Strange behaviour of Strings in Range

Hi,

r1 = ("\000" … “\377”) # all characters?

r1.to_a

=> … “6”, “7”, “8”, “9”]

r1.to_a.size

=> 58

Hm, I guess this is because of “9”.succ gives “10”, and “10” has a size
of two.

But why does “9”.succ results in “10”?

Regards,

Michael

“Michael Neumann” mneumann@ntecs.de schrieb im Newsbeitrag
news:20040501112120.GC794@miya.intranet.ntecs.de

Hi,

r1 = (“\000” … “\377”) # all characters?

r1.to_a

=> … “6”, “7”, “8”, “9”]

r1.to_a.size

=> 58

Hm, I guess this is because of “9”.succ gives “10”, and “10” has a size
of two.

But why does “9”.succ results in “10”?

IMHO this is a perlism so you can count with strings

irb(main):010:0> (“0”…“20”).to_a
=> [“0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”, “8”, “9”, “10”, “11”, “12”,
“13”, “14”, “15”, “16”, “17”, “18”, “19”, “20”]

robert

Hi,

···

In message “Strange behaviour of Strings in Range” on 04/05/01, Michael Neumann mneumann@ntecs.de writes:

r1 = (“\000” … “\377”) # all characters?

r1.to_a

=> … “6”, “7”, “8”, “9”]

r1.to_a.size

=> 58

Hm, I guess this is because of “9”.succ gives “10”, and “10” has a size
of two.

But why does “9”.succ results in “10”?

It’s caused by “succ” magic. Let me think about either subtracting
magic, or adding more magic.

						matz.

Yukihiro Matsumoto wrote:

Hi,

r1 = (“\000” … “\377”) # all characters?

r1.to_a

=> … “6”, “7”, “8”, “9”]

r1.to_a.size

=> 58

Hm, I guess this is because of “9”.succ gives “10”, and “10” has a size
of two.

But why does “9”.succ results in “10”?

It’s caused by “succ” magic. Let me think about either subtracting
magic, or adding more magic.

“9” is not really a character anyway, but a string consisting of
one character.

In current Ruby, 0…0377 would work, since a character is essentially
a Fixnum.

Will Rite have a better-defined notion of “character”? Perhaps including
Unicode and such?

Hal

···

In message “Strange behaviour of Strings in Range” > on 04/05/01, Michael Neumann mneumann@ntecs.de writes:

Hi,

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/03, Hal Fulton hal9000@hypermetrics.com writes:

Will Rite have a better-defined notion of “character”? Perhaps including
Unicode and such?

No. The definition of “character” should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

						matz.

Yukihiro Matsumoto wrote:

Hi,

Will Rite have a better-defined notion of “character”? Perhaps including
Unicode and such?

No. The definition of “character” should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

These are two or three separate issues, I believe.

I know that no one encoding scheme will suffice for Asian languages
as well as European. Unicode in that sense is largely a dream, as I
understand it.

And I do not favor a Char class, which seems unnecessary to me.

But here are some related questions, to get more specific:

  1. Will str[0] always be a Fixnum?

  2. Will ?x always be a Fixnum?

  3. In addition to each_byte, would each_char make sense? As I see it,
    it would default to be the same as each_byte, but would be replaced
    for a wide-char or multibyte variable-length encoding.

But I18N is one of the areas of my greatest ignorance in Ruby.

Thanks,
Hal

···

In message “Re: Strange behaviour of Strings in Range” > on 04/05/03, Hal Fulton hal9000@hypermetrics.com writes:

“Yukihiro Matsumoto” matz@ruby-lang.org schrieb im Newsbeitrag
news:1083540203.859920.6241.nullmailer@picachu.netlab.jp…

Hi,

Will Rite have a better-defined notion of “character”? Perhaps
including
Unicode and such?

No. The definition of “character” should belong to the application
domain, I believe. Considering internationalization, any particular
definition of character can not satisfy all.

So then what’s Unicode for in the first place? I thought the aim was to
have a universal encoding for all chars. Did I miss something?

IMHO Ruby as it is today determines the notion of “character” by the way
strings and regexps are handled and thus a char is a byte. IMHO
characters are so basic that you can’t delegate that to the application
domain. You can delegate transformations but not having an internal
standard representation strikes me as difficult.

Maybe I’m overlooking something, if so, please let me know.

Regards

robert
···

In message “Re: Strange behaviour of Strings in Range” > on 04/05/03, Hal Fulton hal9000@hypermetrics.com writes:

Hi,

But here are some related questions, to get more specific:

  1. Will str[0] always be a Fixnum?

Rite gives 1 char string for str[0].

  1. Will ?x always be a Fixnum?

It will be 1 char string.

  1. In addition to each_byte, would each_char make sense? As I see it,
    it would default to be the same as each_byte, but would be replaced
    for a wide-char or multibyte variable-length encoding.

It makes sense, but I’ve not decided yet to add it.

						matz.
···

In message “Re: Strange behaviour of Strings in Range” on 04/05/03, Hal Fulton hal9000@hypermetrics.com writes:

I found this article really interesting, maybe it can help you too.

···

il Mon, 3 May 2004 10:26:50 +0200, “Robert Klemme” bob.news@gmx.net ha scritto::

So then what’s Unicode for in the first place? I thought the aim was to
have a universal encoding for all chars. Did I miss something?

Hi,

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/03, “Robert Klemme” bob.news@gmx.net writes:

So then what’s Unicode for in the first place? I thought the aim was to
have a universal encoding for all chars. Did I miss something?

It’s their intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

						matz.

“gabriele renzi” surrender_it@remove.yahoo.it schrieb im Newsbeitrag
news:879c90l0qvtoqf3182l83nhjpb1ih6fisb@4ax.com

So then what’s Unicode for in the first place? I thought the aim was
to
have a universal encoding for all chars. Did I miss something?

I found this article really interesting, maybe it can help you too.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software

Nicely written but nothing I didn’t new already. Still the question
remains what Ruby does about handling mixed content internally. IMHO the
most efficient way is to store code points internally. An alternative
would be to store a raw binary stream together with it’s encoding but that
would make comparisons (which happen all the time, just think of hash
lookups) slow for strings with different encodings.

IMHO the Java approach* (although it burns mem by using 16 bit per char)
is the most practical among current programming languages. And I wouldn’t
bother Ruby borrowing that - especially when considering attempts to use
Java bytecode and a JVM as runtime system.

Regards

robert
  • Characters are stored internally with 16 bits, thus allowing a lot
    (although not all) of the Unicode code points to be representable. Input
    and output always uses an encoding (either explicit or implicit the
    platform’s default encoding). There’s built in support for a number of
    well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.
···

il Mon, 3 May 2004 10:26:50 +0200, “Robert Klemme” bob.news@gmx.net > ha scritto::

“Yukihiro Matsumoto” matz@ruby-lang.org schrieb im Newsbeitrag
news:1083589915.224825.7492.nullmailer@picachu.netlab.jp…

Hi,

So then what’s Unicode for in the first place? I thought the aim was
to
have a universal encoding for all chars. Did I miss something?

It’s their intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

But does that mean one shouldn’t try? I mean, Java shows that it can work
quite well (though I don’t know about using Japanese “characters” with
Java). I know, it’s a difficult topic especially since people sticked
with ASCII for such a long time, but I’ve always felt that encodings are a
weak spot of Ruby. But then, maybe I’m overlooking something or some
feature…

Kind regards

robert
···

In message “Re: Strange behaviour of Strings in Range” > on 04/05/03, “Robert Klemme” bob.news@gmx.net writes:

Why is that? Is there not enough room for every character known to
man, or is there some other problem?

Gavin

···

On Monday, May 3, 2004, 11:12:20 PM, Yukihiro wrote:

Hi,

In message “Re: Strange behaviour of Strings in Range” > on 04/05/03, “Robert Klemme” bob.news@gmx.net writes:

So then what’s Unicode for in the first place? I thought the aim was to
have a universal encoding for all chars. Did I miss something?

It’s their intention. Whether it succeeds or not is another story.
I think they tried their best, but it is virtually impossible to
satisfy all requirement for internationalization.

* Characters are stored internally with 16 bits, thus allowing a lot

What do you do when you need 24 bits ?

(although not all) of the Unicode code points to be representable. Input
and output always uses an encoding (either explicit or implicit the
platform's default encoding). There's built in support for a number of
well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.

only western, like I see :-))

Guy Decoux

Hi,

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/03, “Robert Klemme” bob.news@gmx.net writes:

But does that mean one shouldn’t try?

Did I say such thing? Trying is a good thing.

						matz.

Hi,

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/03, Gavin Sinclair gsinclair@soyabean.com.au writes:

Why is that? Is there not enough room for every character known to
man, or is there some other problem?

Some other problems. I really wish things are that simple.

						matz.

“Yukihiro Matsumoto” matz@ruby-lang.org schrieb im Newsbeitrag
news:1083595238.619985.7633.nullmailer@picachu.netlab.jp…

Hi,

But does that mean one shouldn’t try?

Did I say such thing? Trying is a good thing.

Your note “The definition of “character” should belong to the application
domain” sounded to me like you didn’t consider enhancing unicode treatment
in Ruby. I’m sorry if I misread you.

Then what’s the approach planned at the moment?

Kind regards

robert
···

In message “Re: Strange behaviour of Strings in Range” > on 04/05/03, “Robert Klemme” bob.news@gmx.net writes:

“ts” decoux@moulon.inra.fr schrieb im Newsbeitrag
news:200405031322.i43DMWa02004@moulon.inra.fr

  • Characters are stored internally with 16 bits, thus allowing a lot

What do you do when you need 24 bits ?

As far as I can see, currently 20 bits are sufficient :slight_smile:
http://www.unicode.org/charts/

And anything after “Special” looks really quite special to me. At least
western languages as well as Kanji, Hiragana and Katakana are supported.
IMHO pragmatically 16 bits are good enough.

(although not all) of the Unicode code points to be representable.
Input
and output always uses an encoding (either explicit or implicit the
platform’s default encoding). There’s built in support for a number
of
well known encodings, including UTF-8, UTF-16, ISO-8859-1 etc.

only western, like I see :-))

I didn’t sent the complete list. Apart from that, UTF-8 and UTF-16 handle
all unicode chars. See

Regards

robert

Hi,

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/04, “Robert Klemme” bob.news@gmx.net writes:

Your note “The definition of “character” should belong to the application
domain” sounded to me like you didn’t consider enhancing unicode treatment
in Ruby. I’m sorry if I misread you.

Then what’s the approach planned at the moment?

Basic idea is your “alternative” in [ruby-talk:99089].
We prove it’s not insane though making prototype.

Could you search the ruby-talk archive with keyword I18N for more
detail? Or you can check ruby_m17n branch in the CVS.

						matz.

As far as I can see, currently 20 bits are sufficient :slight_smile:
Unicode 15.1 Character Code Charts

What do you do with documents with Japanese EUC encoding

I didn't sent the complete list. Apart from that, UTF-8 and UTF-16 handle
*all* unicode chars. See

Like I've said previously : european centric vision ...

Guy Decoux