I write Ruby plugins for Google Sketchup.
Sketchup uses UTF-8 strings and passes this to ruby (1.8) - which
handles Strings as simple series of bytes. This caused problems when I
tried to pass a String I got from Sketchup which contained a file path
with some Norwegian letters. (æøåÆØÅ) as ruby then raised an error
saying the file/path didn't exist.
This was because æøåÆØÅ lies outside the ASCII character set so it was
returned as double byte characters in UTF-8.
Searching the net I found some hacks that converted UTF-8 into single
byte characters: str_utf8.unpack('U*').pack('C*')
The Norwegian characters lies outside the ASCII range, but still they
get packed into single bytes characters that the File classes can
handle.
Example:
'æøåÆØÅ'.length # <- all these characters causes the File class to fail
12
'æøåÆØÅ'.unpack('U*').pack('C*').length # <- File class now can handle
this
6
So it seems that the File class doesn't just handle ASCII, but maybe
ANSI (Windows-1252) or ISO-8859-1. Or does this depend on some system
setting?
My tests has been on a Norwegian Windows XP system with Norwegian
locale. Default language for applications that doesn't support Unicode
is also set to Norwegian.
To summon up what I'm trying to work out is how UTF-8 characters above
the ASCII range (0-127) is mapped to the 128-255 range. Does the 128-255
range refer to ANSI (1252) or ISO-8859-1? <- and is this due to system
settings?
···
--
Posted via http://www.ruby-forum.com/\.
Looking at ISO/IEC 8859-1 - Wikipedia
There's an extra character code besides the code that equal to the ANSI
code. It consist of 3 integers ranging from 0-7. This code can be used
in Ruby in conjunction with the escape character:
ANSI ISO-8859-1
···
--------------------
æ - 230 - 230 / 346
ø - 248 - 248 / 370
å - 229 - 229 / 345
Æ - 198 - 198 / 306
Ø - 216 - 216 / 330
Å - 197 - 197 / 305
"\306".length # Code for Æ
1
Since this code doesn't exist in ANSI it seem to me that Ruby interprets
ISO-8859 encoding. But I'm still wondering if this is system
dependant...
--
Posted via http://www.ruby-forum.com/\.
What you are doing there is transcoding from UTF-8 to Latin-1 (or ISO-8859-1). Here's the proof:
$ ruby -KU -r iconv -e 'utf8 = "æøåÆØÅ"; p utf8.unpack("U*").pack("C*") == Iconv.conv("ISO-8859-1", "UTF-8", utf8)'
true
James Edward Gray II
···
On Jul 7, 2009, at 7:28 AM, Thomas Thomassen wrote:
Searching the net I found some hacks that converted UTF-8 into single
byte characters: str_utf8.unpack('U*').pack('C*')
Hello,
how Windows interprets file paths depends on which API calls you use
and on the current system locale. There is one set of Windows API
functions that always use UTF-16 and another one that always uses the
encoding associated with the current system locale.
I think Ruby indirectly accesses the latter API and doesn't do any
character set conversions before passing strings to the operating
system, but I'm not entirely sure there.
cu,
Thomas
···
2009/7/7 Thomas Thomassen <thomas@thomthom.net>:
[...]
So it seems that the File class doesn't just handle ASCII, but maybe
ANSI (Windows-1252) or ISO-8859-1. Or does this depend on some system
setting?
[...]
--
When C++ is your hammer, every problem looks like your thumb.
Sorry, I just realised that the extra number was the octal variant.
So it could still be ANSI...
···
--
Posted via http://www.ruby-forum.com/.
Gregory Brown wrote:
Gray Soft / Not Found
I have seen that series. I still can't work out how Ruby determines what
UTF-8 character to map to the 128-255 spaces.
···
--
Posted via http://www.ruby-forum.com/\.
I missed why you wouldn't just set $KCODE="U" and stick w. UTF-8?
Anyway, I *think* chars.pack("C*") is going to give you ISO-8859-1 but
someone else will need to verify for you.
-greg
···
On Tue, Jul 7, 2009 at 10:21 AM, Thomas Thomassen<thomas@thomthom.net> wrote:
Gregory Brown wrote:
Gray Soft / Not Found
I have seen that series. I still can't work out how Ruby determines what
UTF-8 character to map to the 128-255 spaces.
Also, since you know the original encoding, you can use IConv to
explicitly convert to whatever you want.
-greg
···
On Tue, Jul 7, 2009 at 10:42 AM, Gregory Brown<gregory.t.brown@gmail.com> wrote:
On Tue, Jul 7, 2009 at 10:21 AM, Thomas Thomassen<thomas@thomthom.net> wrote:
Gregory Brown wrote:
Gray Soft / Not Found
I have seen that series. I still can't work out how Ruby determines what
UTF-8 character to map to the 128-255 spaces.
I missed why you wouldn't just set $KCODE="U" and stick w. UTF-8?
Anyway, I *think* chars.pack("C*") is going to give you ISO-8859-1 but
someone else will need to verify for you.
Gregory Brown wrote:
I missed why you wouldn't just set $KCODE="U" and stick w. UTF-8?
Because Sketchup uses Ruby to allow users to write plugins for the
applications. That flag, as far as I understand, is global and would
affect all scripts which might break a number of things. Also, the Ruby
1.8 version shipped with SU is not the whole package. Not sure if it's
possible even if I wanted it.
Anyway, I *think* chars.pack("C*") is going to give you ISO-8859-1 but
someone else will need to verify for you.
-greg
I'm currently looking into if the UTF-8 decimal codepoints (in range
128-255) are similar to the ISO-8859-1 or ANSI. That might be the
answer.
···
--
Posted via http://www.ruby-forum.com/\.
I've been doing some testing on the 128-255 range, and from what I can
gather all code points within the ISO 8859-1 range are identical with
UTF-8.
···
--
Posted via http://www.ruby-forum.com/.
I checked the $KCODE variable and it returns "UTF8".
Now, what does that do to Ruby? Why does File.exist?('c:\Test æøå') fail
if it's UTF-8 encoded?
···
--
Posted via http://www.ruby-forum.com/.
I checked the $KCODE variable and it returns "UTF8".
Now, what does that do to Ruby?
I answer that question in detail in this article:
Why does File.exist?('c:\Test æøå') fail if it's UTF-8 encoded?
The IO methods are not $KCODE aware. You will likely need to transcode the Strings you pass them.
James Edward Gray II
···
On Jul 8, 2009, at 11:30 AM, Thomas Thomassen wrote:
James Gray wrote:
I checked the $KCODE variable and it returns "UTF8".
Now, what does that do to Ruby?
I answer that question in detail in this article:
Gray Soft / Not Found
Why does File.exist?('c:\Test ���') fail if it's UTF-8 encoded?
The IO methods are not $KCODE aware. You will likely need to
transcode the Strings you pass them.
James Edward Gray II
What does the IO method require?
That's a good question. I'm not sure what it does on Windows.
Is it the Ruby IO methods or the system methods it calls that doesn't handle UTF-8?
I assume it's the underlying Windows API, though I'm just guessing there.
Windows' NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?
I think it depends on which API methods you call, so I'm guessing you cannot do this. I think Ruby would need to be changed to use those methods first.
I'm trying to avoid transcoding to a 8bit only encoding as that'll just
cause grief when I encounter characters outside the range.
Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has been improved there, using the new encoding support. I don't know that it has. I'm more just wondering out-loud…
James Edward Gray II
···
On Jul 8, 2009, at 11:42 AM, Thomas Thomassen wrote:
On Jul 8, 2009, at 11:30 AM, Thomas Thomassen wrote:
James Gray wrote:
Why does File.exist?('c:\Test ���') fail if it's UTF-8
encoded?
The IO methods are not $KCODE aware. You will likely need to
transcode the Strings you pass them.
James Edward Gray II
What does the IO method require?
That's a good question. I'm not sure what it does on Windows.
Any clues what I does on OSX? The scripts will run on macs as well.
Is it the Ruby IO methods or the system methods it calls that
doesn't handle UTF-8?
I assume it's the underlying Windows API, though I'm just guessing
there.
Windows' NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?
I think it depends on which API methods you call, so I'm guessing you
cannot do this. I think Ruby would need to be changed to use those
methods first.
Since NTFS supports UTF, then I guess it's the Ruby API that calls the
wrong WinAPIs?
Can I make my own API calls?
I'm trying to avoid transcoding to a 8bit only encoding as that'll
just
cause grief when I encounter characters outside the range.
Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has
been improved there, using the new encoding support. I don't know
that it has. I'm more just wondering out-loud…
James Edward Gray II
The scripts I write is plugins for Google Sketchup - so the Ruby version
I have at disposal is the one Sketchup bundles - a partial 1.8 version.
While I've been searching for solutions I've noticed that v1.9 have
better support for various encoding, but unfortunately it's of no use
for me.
So my problem is that I have to deal with string data that comes from
Sketchup in UTF-8 format - might even have to deal with files and folder
that include characters outside the Windows1252 or ISO8859 range
(whatever the IO functions are using - I've not been able to pin-point
this.). If I get characters outside that range it's impossible to
transcode.
Andd, I also don't know what would happen for an eastern user. I'm
wondering if the IO functions would assume a different 8bit encoding...
···
On Jul 8, 2009, at 11:42 AM, Thomas Thomassen wrote:
--
Posted via http://www.ruby-forum.com/\.
> Windows' NTFS format supports UTF-16 encoding - would it work if I
> transcoded the strings from UTF-8 to UTF-16?
I think it depends on which API methods you call, so I'm guessing you cannot do this. I think Ruby would need to be changed to use those methods first.
> I'm trying to avoid transcoding to a 8bit only encoding as that'll just
> cause grief when I encounter characters outside the range.
Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has been improved there, using the new encoding support. I don't know that it has. I'm more just wondering out-loud…
It's only begun to improve as of the ruby 1.9.2 development version.
(1.9.1 and earlier use the 8-bit windows file API routines.)
This ruby-core post provides a partial list of methods in 1.9.2dev
which now work with windows unicode paths, as of
1.9.2dev (2009-06-24) [i386-mswin32_71]
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24010
Regards,
Bill
···
From: "James Gray" <james@grayproductions.net>
On Jul 8, 2009, at 11:42 AM, Thomas Thomassen wrote:
Bill Kelly wrote:
From: "James Gray" <james@grayproductions.net>
Have you had a look at Ruby 1.9 yet? I'm wondering if this issue has been improved there, using the new encoding support. I
don't know that it has. I'm more just wondering out-loud…
It's only begun to improve as of the ruby 1.9.2 development version.
(1.9.1 and earlier use the 8-bit windows file API routines.)
This ruby-core post provides a partial list of methods in 1.9.2dev
which now work with windows unicode paths, as of
1.9.2dev (2009-06-24) [i386-mswin32_71]
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/24010
Regards,
Bill
I see. So Ruby calls Win32 APis that doesn't handle UTF-8. But what do
they use then? windows-1252? (Or would that be system dependant?) If
it's not a fixed character set, is there any way of finding that out -
so that I have a chance to try to transcode correctly?
···
--
Posted via http://www.ruby-forum.com/\.
James Gray wrote:
That's a good question. I'm not sure what it does on Windows.
Any clues what I does on OSX? The scripts will run on macs as well.
Unlike that other OS, both OS X and Linux have taken an approach
I like to refer to as, NOT MIND-NUMBINGLY STUPID.
In OS X and Linux, one can use the same API calls one has always
used, as they are now UTF-8 savvy.
Windows' NTFS format supports UTF-16 encoding - would it work if I
transcoded the strings from UTF-8 to UTF-16?
I think it depends on which API methods you call, so I'm guessing you
cannot do this. I think Ruby would need to be changed to use those
methods first.
Since NTFS supports UTF, then I guess it's the Ruby API that calls the wrong WinAPIs?
Can I make my own API calls?
In ruby 1.8 embedded into our C++ application, I've created hooks
so that I can call our unicode-savvy C++ routines from ruby.
I suppose it may be possible to do this without involving a
ruby C extension, assuming the ruby Win32API module can
be made to call routines like _wopen and such. I haven't tried that.
The scripts I write is plugins for Google Sketchup - so the Ruby version I have at disposal is the one Sketchup bundles - a partial 1.8 version.
While I've been searching for solutions I've noticed that v1.9 have better support for various encoding, but unfortunately it's of no use for me.
So my problem is that I have to deal with string data that comes from Sketchup in UTF-8 format - might even have to deal with files and folder that include characters outside the Windows1252 or ISO8859 range (whatever the IO functions are using - I've not been able to pin-point this.). If I get characters outside that range it's impossible to transcode.
Andd, I also don't know what would happen for an eastern user. I'm wondering if the IO functions would assume a different 8bit encoding...
For best 8-bit compatibility you'll want to encode to Windows1252.
But, this (of course) won't help at all with chinese characters, etc.
Regards,
Bill
···
From: "Thomas Thomassen" <thomas@thomthom.net>
I just tried on a Mac - It worked fine with Norwegian letters there. So
it seems that Ruby 1.8 on OSX calls UTF-8 aware IO system calls.
Then it's the question of what encoding is used on Windows.
And can I can UTF-8 aware Windows IO API methods myself - bypassing the
built in ruby?
···
--
Posted via http://www.ruby-forum.com/.