Unicode

I hate to discuss something related to the development timeline, I know its tenable, but When will it be reasonable to expect Unicode support from Ruby?

Ruby has some UTF-8 support today. Support will increase with the m17n support though.

See last question and answer here:

James Edward Gray II

···

On Sep 14, 2007, at 9:05 PM, Zephyr Pellerin wrote:

I hate to discuss something related to the development timeline, I know its tenable, but When will it be reasonable to expect Unicode support from Ruby?

Zephyr Pellerin wrote:

I hate to discuss something related to the development timeline, I know its tenable, but When will it be reasonable to expect Unicode support from Ruby?

"Unicode" is not an encoding. Are you asking for UTF-8, UTF-16, or something else?

···

--
  Phlip

Zephyr Pellerin wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

I was just looking at the source code for 1.8.6 this weekend. The C
syntax that's being used is pre-ANSI-C (which means in 1988, it was
"old" syntax).

Rotsa Ruck.

Todd

···

--
Posted via http://www.ruby-forum.com/\.

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
, ...
You have to remember to convert the string back, though.

Thanks

Michal

···

On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

I was just looking at the source code for 1.8.6 this weekend. The C
syntax that's being used is pre-ANSI-C (which means in 1988, it was
"old" syntax).

Apples and oranges. Unicode libraries like iconv use C linkage, so they can bond with most C implementations regardless of their compliance. (C linkage is very weak and simplistic.) All Cs can handle 8-bit strings, and can be programmed to use 16-bit strings, which are the requirements for UTF-8 and UTF-16.

Like most languages, Ruby's source is in a primitive form of C to maximize the number of compilers, and hence the number of platforms and hardwares, that it runs on. I would suspect - unless Matz is an even greater genius than average - that Ruby's C style has been carefully retrofitted, after the language passed its first few version ticks.

Rotsa Ruck.

Racial slur noted.

···

--
  Phlip

Michal Suchanek wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
, ...
You have to remember to convert the string back, though.

Thanks

Michal

... or you may use the /re/u regex option to handle UTF-8 encoded
strings (cf. http://snippets.dzone.com/posts/show/4527 ).

Cheers,

j.k.

···

On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:

--
Posted via http://www.ruby-forum.com/\.

What about UTF-16?

http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/

···

On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote:

On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
> I hate to discuss something related to the development timeline, I know
> its tenable, but When will it be reasonable to expect Unicode support
> from Ruby?

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
, ...
You have to remember to convert the string back, though.

--
Felipe Contreras

Phlip wrote:

Racial slur noted.

You got a problem with Scooby Doo?

For the record, this was NOT intended to slur anything. It was not my
intent, nor is my nature, to slur. However, reading this in hindsight,
it certainly could be taken this way. Please accept my apologies.

Now, I'll rephrase.

Lotsa luck getting something like Unicode implemented when the
underlying C contructs are using such an outdated syntax as ruby's does.

But, as Phlip implies, it's just a simple matter of programming.

Todd

···

--
Posted via http://www.ruby-forum.com/\.

Go to unicode.org
There you can read a full explanation (or a brief one) about why you don't need to worry about UTF-16
UTF-8 is all you need.
Unicode is something everyone needs to read up on at some point.
I have to read up on every now and then because my brain leaks.

···

On Sep 28, 2007, at 4:49 PM, Felipe Contreras wrote:

On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote:

On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:

I hate to discuss something related to the development timeline, I know
its tenable, but When will it be reasonable to expect Unicode support
from Ruby?

Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
is set to "U" (and the default is "N" even in UTF-8 locales, and if
you specify the -K option in the .rb file it overrides the option
specified on the command line, heh).
The non-regex methods do not work but you can convert the string with
str.scan(/./)[0] or str.unpack "U*", and use stuff like each, reverse,
, ...
You have to remember to convert the string back, though.

What about UTF-16?

http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/

--
Felipe Contreras

oh, and Mr. Contreras,
I did not mean to say RTFM to you. Sorry if it seemed like that.

Todd Burch wrote:

For the record, this was NOT intended to slur anything. It was not my
intent, nor is my nature, to slur. However, reading this in hindsight,
it certainly could be taken this way. Please accept my apologies.

Oh my apologies too - Scooby Doo is quite over my head. All I could
imagine was Matz in a kimono serving Sake.

···

--
Phlip

Hi,

Lotsa luck getting something like Unicode implemented when the
underlying C contructs are using such an outdated syntax as ruby's does.

Old K&R style has nothing related to Unicode support of the language.
If you think it does, please elaborate.

It just reflects the history of the language. When I started
developing Ruby, old Sun CC compiler does not understand new style,
and I wanted Ruby to run on that platform, which I was using then.

For your information, the next release (1.9) finally abandoned the old
style.

              matz.

···

In message "Re: Unicode" on Mon, 17 Sep 2007 22:50:12 +0900, Todd Burch <promos@burchwoodusa.com> writes:

Yes but what about stuff already encoded in UTF-16?

···

On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

On Sep 28, 2007, at 4:49 PM, Felipe Contreras wrote:

> On 9/21/07, Michal Suchanek <hramrach@centrum.cz> wrote:
>> On 15/09/2007, Zephyr Pellerin <ztz@nxvr.org> wrote:
>>> I hate to discuss something related to the development timeline,
>>> I know
>>> its tenable, but When will it be reasonable to expect Unicode
>>> support
>>> from Ruby?
>>
>> Ruby has unicode support. Sort of. Regexes work in UTF-8 when $KCODE
>> is set to "U" (and the default is "N" even in UTF-8 locales, and if
>> you specify the -K option in the .rb file it overrides the option
>> specified on the command line, heh).
>> The non-regex methods do not work but you can convert the string with
>> str.scan(/./)[0] or str.unpack "U*", and use stuff like each,
>> reverse,
>> , ...
>> You have to remember to convert the string back, though.
>
> What about UTF-16?
>
> http://blogs.gnome.org/sudaltsov/2007/09/22/ruby-part2/
>
> --
> Felipe Contreras
>
Go to unicode.org
There you can read a full explanation (or a brief one) about why you
don't need to worry about UTF-16
UTF-8 is all you need.
Unicode is something everyone needs to read up on at some point.
I have to read up on every now and then because my brain leaks.

--
Felipe Contreras

Yukihiro Matsumoto wrote:

Hi,

Old K&R style has nothing related to Unicode support of the language.
If you think it does, please elaborate.

It just reflects the history of the language. When I started
developing Ruby, old Sun CC compiler does not understand new style,
and I wanted Ruby to run on that platform, which I was using then.

For your information, the next release (1.9) finally abandoned the old
style.

              matz.

Thanks Matz.

I'm new to C programming, but not new to programming. Therefore, my
assumption (yes, assumption) was that using whatever compiler swithes
were necessary to accept the old-style syntax would obviate the
opportunity to bring in "modern" libraries with unicode support, and/or
prohibit those aspects of the language that would enable the use of
unicode features.

So, apparently, since they ("they" being unicode support and the
syntax/compiler switches) are not related, and that's great.

By the way, as an aside, I really like the language you developed and
have made available. I primarily use Ruby with SketchUp (a 3D modeling
program - http://www.sketchup.com) for extending the functionality of
the product. (SketchUp has a Ruby API) I was looking at the source to
see what it would take to implement a debugger than would work with Ruby
while running under SketchUp. I would like to step through expression
evaluation as the script runs.

(Big aspirations for a new C programmer like myself!)

Todd

···

--
Posted via http://www.ruby-forum.com/\.

Yes but what about stuff already encoded in UTF-16?

That's why I said read up on unicode!
After you read that stuff you'll understand why it's no problem.
I'm not going to explain it. Many people understand it, but when explaining it might make mistakes.
Read the unicode stuff carefully. It's vital for many things.

The only thing you might run into is BOM or Endian-ness, but it's doubtful it will be an issue in most cases.

This might get you started.

Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2 bytes for lower-level code-points (the stuff also known as ASCII range) where UTF-8 does not.

You really need to spend an afternoon reading about unicode. It should be required in any computer science program as part of an encoding course, Americans in particular are often the ones who know the least about it....

Hi,

···

In message "Re: Unicode" on Tue, 18 Sep 2007 00:50:36 +0900, Todd Burch <promos@burchwoodusa.com> writes:

I'm new to C programming, but not new to programming. Therefore, my
assumption (yes, assumption) was that using whatever compiler swithes
were necessary to accept the old-style syntax would obviate the
opportunity to bring in "modern" libraries with unicode support, and/or
prohibit those aspects of the language that would enable the use of
unicode features.

Even though the old style has some drawbacks (less type checks for
example), it does not have any linkage problem you've worried.

              matz.

That's not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt > japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
       14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
       16 45 3968 japanese_prose_in_utf16.txt

James Edward Gray II

···

On Sep 29, 2007, at 2:13 PM, John Joyce wrote:

The short version is that UTF-16 is basically wasteful.

Hi,

>
> Yes but what about stuff already encoded in UTF-16?

That's why I said read up on unicode!
After you read that stuff you'll understand why it's no problem.
I'm not going to explain it. Many people understand it, but when
explaining it might make mistakes.
Read the unicode stuff carefully. It's vital for many things.

The only thing you might run into is BOM or Endian-ness, but it's
doubtful it will be an issue in most cases.

This might get you started.
FAQ - UTF-8, UTF-16, UTF-32 & BOM

Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
how programmers need to know it and how few actually do.
The short version is that UTF-16 is basically wasteful. It uses 2
bytes for lower-level code-points (the stuff also known as ASCII
range) where UTF-8 does not.

As you suggested I read the article:

I didn't find anything new. It's just explaining character sets in a
rather non-specific way. ASCII uses 8 bits, so it can store 256
characters, so it can't store all the characters in the world, so
other character sets are needed (really? I would have never guessed
that). UTF-16 basically stores characters in 2 bytes (that means more
characters in the world), UTF-8 also allows more characters it doesn't
necessarily needs 2 bytes, it uses 1, and if the character is beyond
127 then it will use 2 bytes. This whole thing can be extended up to 6
bytes.

So what exactly am I looking for here?

You really need to spend an afternoon reading about unicode. It
should be required in any computer science program as part of an
encoding course, Americans in particular are often the ones who know
the least about it....

What is there to know about Unicode? There's a couple of character
sets, use UTF-8, and remember that one character != one byte. Is there
anything else for practical purposes?

I'm sorry if I'm being rude, but I really don't like when people tell
me to read stuff I already know.

My question is still there:

Let's say I want to rename a file "fooobar", and remove the third "o",
but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
of course there will still be a 0x00 in there. That's if the string is
recognized at all.

Why is there no issue with UTF-16 if only UTF-8 is supported?

I don't mind reading some more if I can actually find the answer.

Best regards.

···

On 9/29/07, John Joyce <dangerwillrobinsondanger@gmail.com> wrote:

--
Felipe Contreras

interesting that you would generate more lines, fewer words, and fewer bytes (probably explained by fewer words..)
wc defines words as whitespace delimited, Extremely interesting considering that Japanese uses no whitespace except in page layout. Grammar does not dictate any whitespace at all. At most in Japanese prose you might have one whitespace between sentences, perhaps only between "paragraphs"

I don't know how iconv handles things. man iconv says it uses iswspace(3) which is in wctype.h but I always hate reading those headers.
I tried using iconv on a file in utf-8 to utf-16 and then back again. Results are similar, but interstingly, it's no indication of file size. Files are the same size
I then tried the same with some code in C++ and similar results occured.
It would seem to be a whitspace issue. I didn't realize this, but it does look like utf-8 is generating fewer whitespace characters while generating a bigger file...?
I'm curious what the deal is there.

In theory utf-8 should do better than utf-16 for characters in the ASCII range...
at least that was my understanding. And assuming code files are largely ASCII character sets...
hmm...!?

···

On Sep 29, 2007, at 2:29 PM, James Edward Gray II wrote:

On Sep 29, 2007, at 2:13 PM, John Joyce wrote:

The short version is that UTF-16 is basically wasteful.

That's not always accurate:

$ iconv -f utf-8 -t utf-16 japanese_prose_in_utf8.txt > japanese_prose_in_utf16.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf8.txt
      14 66 5921 japanese_prose_in_utf8.txt
Firefly:~/Desktop$ wc japanese_prose_in_utf16.txt
      16 45 3968 japanese_prose_in_utf16.txt

James Edward Gray II