[Q] From Windows internal format to UTF-8?

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what is
the encoding returned by Dir.each…
What does people do in Ruby to do the transcoding?

Or if the transcoding is too complicated to do, how do I find the encoding
corresponding to the localisation (French)so that my webpage shows the
accent correctly?

Google and online-documentation weren’t helpfull, which I find strange
because this problem must be quite common…

I’m currently using ruby 1.6.8, but I can upgrade to a newer version if
needed

Thanks for your help, and have a nice day.

Renox.

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Tobias

···

On Sun, 20 Jul 2003, renoX wrote:

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what is
the encoding returned by Dir.each…

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, and if it doesn’t work, I’ll look at the sources of
Dir.each to see what system call it use, in a windows programming group they
should be able to tell which format it uses, apparently the library
documentation is a “bit high level” (at least on Windows) on these
“details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up to
the programmer to know…
The strange thing (for me) is that apparently there is no library in Ruby to
do the transcoding.
Of course it’d be nice when it will be incorporated into the language,
but I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

···

On Sun, 20 Jul 2003, renoX wrote:

Tobias

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, and if it doesn’t work, I’ll look at the sources of
Dir.each to see what system call it use, in a windows programming group they
should be able to tell which format it uses, apparently the library
documentation is a “bit high level” (at least on Windows) on these
“details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up to
the programmer to know…
The strange thing (for me) is that apparently there is no library in Ruby to
do the transcoding.
Of course it’d be nice when it will be incorporated into the language,
but I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

···

On Sun, 20 Jul 2003, renoX wrote:

Tobias

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, and if it doesn’t work, I’ll look at the sources of
Dir.each to see what system call it use, in a windows programming group they
should be able to tell which format it uses, apparently the library
documentation is a “bit high level” (at least on Windows) on these
“details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up to
the programmer to know…
The strange thing (for me) is that apparently there is no library in Ruby to
do the transcoding.
Of course it’d be nice when it will be incorporated into the language,
but I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

···

On Sun, 20 Jul 2003, renoX wrote:

Tobias

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, if it doesn’t work, I’ll look at the sources of
Dir.each to see what format it returns,
apparently the library documentation is a “bit high level” (at least on
Windows) on these “details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up to
the programmer to know,
but the strange thing (for me) is that apparently there is no library in
Ruby to do the transcoding.
Of course it’d be nice when it will be incorporated into the language but
I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

···

On Sun, 20 Jul 2003, renoX wrote:

Tobias

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, if it doesn’t work, I’ll look at the sources of
Dir.each to see what format it returns,
apparently the library documentation is a “bit high level” (at least on
Windows) on these “details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up to
the programmer to know,
but the strange thing (for me) is that apparently there is no library in
Ruby to do the transcoding.
Of course it’d be nice when it will be incorporated into the language but
I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

···

On Sun, 20 Jul 2003, renoX wrote:

Tobias

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, if it doesn’t work, I’ll look at the sources of
Dir.each to see what format it returns,
apparently the library documentation is a “bit high level” (at least on
Windows) on these “details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up to
the programmer to know,
but the strange thing (for me) is that apparently there is no library in
Ruby to do the transcoding.
Of course it’d be nice when it will be incorporated into the language but
I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

···

On Sun, 20 Jul 2003, renoX wrote:

Tobias

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute. This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, if it doesn’t work, I’ll look at the sources of
Dir.each to see what format it returns,
apparently the library documentation is a “bit high level” (at least on
Windows) on these “details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up to
the programmer to know,
but the strange thing (for me) is that apparently there is no library in
Ruby to do the transcoding.
Of course it’d be nice when it will be incorporated into the language but
I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

···

On Sun, 20 Jul 2003, renoX wrote:

Tobias

An error message made me beleive that the message wasn’t posted, so of
course I retried :frowning:
OK, I found what Dir.each returns on WindowsXP: it is ISO-8859-1 or
windows-1252 (both are very similar).

So what I must find is how to “convert” this to ASCII (remove the accent
from the letters)…
I think that some mapping table/function must already exist, I just need to
find it!

Best regards and sorry again for the duplicates
RenoX

“renoX” renZYX@hotmail.com a écrit dans le message de
news:bfk0u2$af1$1@news-reader3.wanadoo.fr

“Tobias Peters” tpeters@uni-oldenburg.de a écrit dans le message de
news:Pine.LNX.4.44.0307221251240.3017-100000@localhost.localdomain…

Hello,
I’m a newbie in Ruby and I have a problem with character encoding.

I use Ruby to list the file contained in a directory in Windows.
Those filenames contains non-ASCII character (French accent,etc)…

I’d like to put the name of those file in an XHTML webpage.
I have two possibilitties:

  • I transcode from the ‘native Windows encoding’ to UTF-8.
    or - I declare in the XHTML file to be in the encoding used by Ruby on
    Windows.

My preferred solution would be do the transcoding, but I don’t know
what
is
the encoding returned by Dir.each…

This is a can of worms, and it goes far beyond the ruby language. To
handle character encodings correctly, each character string as well as
each source and each sink of characters needs an :encoding attribute.
This
is planned for ruby 1.9/2.0.

This explains why I didn’t found any information on the subject…

Since you work on windows, you might be able to get the necessary
information (encoding of filesystem names) from the OS (but don’t ask me
how), since, AFAIK, NTFS stores file names in Unicode (UTF16 or UCS2, I
think) and Windows API transcodes filenames from / to the default
character encoding for your application. FAT filesystems store filenames
in some default encoding, and windows might know which and translate as
needed (i am not sure).

I’ll try UTF16 or UCS2, if it doesn’t work, I’ll look at the sources of
Dir.each to see what format it returns,
apparently the library documentation is a “bit high level” (at least on
Windows) on these “details”.

On Linux, the situation is far worse, since in EXT2 / EXT3, filenames
are
just strings of bytes (only 0x0 and 0x2f are disallowed).
I have no clue how ruby (or any other program) could determine the
encoding of filenames there.

Well I think that Ruby (or any other program) cannot know, but that’s up
to

···

On Sun, 20 Jul 2003, renoX wrote:
the programmer to know,
but the strange thing (for me) is that apparently there is no library in
Ruby to do the transcoding.
Of course it’d be nice when it will be incorporated into the language but
I’m surprised that in the meantime there is no library to do the same
purpose,
I’ll check again in the RAA to see if I missed it the first time.

Thanks a lot for your help.
RenoX

Tobias

Hi,

···

In message “Re: [Q] From Windows internal format to UTF-8?” on 03/07/23, “renoX” renZYX@hotmail.com writes:

The strange thing (for me) is that apparently there is no library in Ruby to
do the transcoding.

How about iconv? Isn’t it available on Windows?

						matz.

Hi,

···

At Wed, 23 Jul 2003 11:50:50 +0900, Yukihiro Matsumoto wrote:

The strange thing (for me) is that apparently there is no library in Ruby to
do the transcoding.

How about iconv? Isn’t it available on Windows?

libiconv is ported to Windows.

LibIconv for Windows


Nobu Nakada

Thanks a lot, I was looking in the RAA , but as long as I can do the
conversion, iconv will be fine.

RenoX

nobu.nokada@softhome.net a écrit dans le message de
news:200307230324.h6N3OPII007216@sharui.nakada.kanuma.tochigi.jp…

Hi,

The strange thing (for me) is that apparently there is no library in
Ruby to

···

At Wed, 23 Jul 2003 11:50:50 +0900, > Yukihiro Matsumoto wrote:

do the transcoding.

How about iconv? Isn’t it available on Windows?

libiconv is ported to Windows.

LibIconv for Windows


Nobu Nakada