Strange behaviour of Strings in Range

Martin_Elzen · 4 May 2004 14:28

Hi everyone.

I’ve looked up the references Robert Klemme found by searching for I18N on
the Ruby-talk archives, I also did the same myself, and also read up some
more about Unicode. One of the requirements someone wrote up was that
multinationalizing Ruby may not make it any harder to work with Japanese
character set than it already was, and that having to convert existing
Japanese files to Unicode was simply not an option.

Given all that (to my surprise) I’ve reached the conclusion that Unicode
does not seem like the way to go (at least, not as part of the Ruby
language proper - a good Unicode library does seem like a must). It looks
like the most flexible option is to permit the creator of a string object to
specify it’s character width, and give the string object the capability of
reporting it’s own character width. Basically that’s what the C++
standardisation group did - IIRC any sequence of ‘character-like’ objects
that can be copied without causing calls to non default constructors or
destructors (and especially not calls to non default copy constructors) is,
technically and in principle, a standards compliant C++ string, regardless
of the size in character width of those ‘character-like’ objects. Of
course, the C++ standard also defines IO stream and stringstream libraries
that can be used by both std::basic_string and
std::basic_string<wchar_t> objects… (though unfortunately not all library
implementations have caught up yet). A more doable alternative to requiring
such character-width independent libraries might be the use of special
reader/writer classes to handle conversions to and from specific byte-widths
(and perhaps also endian-ness conversions) where necessary (which is what
Java does, I gather).

Sincerely,
Martin

···

MSN Zoeken, voor duidelijke zoekresultaten! http://search.msn.nl

Mike_Calder · 4 May 2004 16:11

As an outsider, and as an application programmer, if the language has to
resort to special add-on libraries or different method names to handle
different character data types, forget it.

A string is a string is a string. As an application programmer I want to be
able to use substring, split, character by positional index, and all the
other standard string methods and not have to worry about what kind of string
data I am handling. I’m prepared to pass a parameter to stream handlers
telling them what type of encoding to use between external streams and
internal representations if I have to - or even between different internal
representations if that’s the sort of thing that is a particular programmer’s
bag. I could even live with parameters like that on the standard string
methods if I had to.

But the original code that I code and test to work with &mylanguageofchoice;
MUST work with ANY other language, without change. All other language users
must have to do to get my program to work in their language is to translate
fixed strings (and perhaps change GUI layouts because of sizings). This is
absolutely basic and should be in the Ruby core. I thought this kind of
thing was why we have OO.

How you do it, I don’t care. I don’t care what the internal representation is;
that’s an implementation detail, the sort of thing that OO should be hiding
from me anyway. All I know is I’m handling a string, and I want to do string
manipulations. I don’t care is it’s Unicode or ASCII, Kanji or Kanuck.

Sorry, if I can’t do that in Ruby, it’s broken and, I’m afraid, unusable.

···

On Tuesday 04 May 2004 15:28, Martin Elzen wrote:

Hi everyone.

I’ve looked up the references Robert Klemme found by searching for I18N on
the Ruby-talk archives, I also did the same myself, and also read up some
more about Unicode. One of the requirements someone wrote up was that
multinationalizing Ruby may not make it any harder to work with Japanese
character set than it already was, and that having to convert existing
Japanese files to Unicode was simply not an option.

Given all that (to my surprise) I’ve reached the conclusion that Unicode
does not seem like the way to go (at least, not as part of the Ruby
language proper - a good Unicode library does seem like a must). It looks
like the most flexible option is to permit the creator of a string object
to specify it’s character width, and give the string object the capability
of reporting it’s own character width. Basically that’s what the C++
standardisation group did - IIRC any sequence of ‘character-like’ objects
that can be copied without causing calls to non default constructors or
destructors (and especially not calls to non default copy constructors) is,
technically and in principle, a standards compliant C++ string, regardless
of the size in character width of those ‘character-like’ objects. Of
course, the C++ standard also defines IO stream and stringstream libraries
that can be used by both std::basic_string and
std::basic_string<wchar_t> objects… (though unfortunately not all library
implementations have caught up yet). A more doable alternative to
requiring such character-width independent libraries might be the use of
special reader/writer classes to handle conversions to and from specific
byte-widths (and perhaps also endian-ness conversions) where necessary
(which is what Java does, I gather).

Sincerely,
Martin

MSN Zoeken, voor duidelijke zoekresultaten! http://search.msn.nl

–
Clear skies!
Mike Calder.
If you have received this email in error, please notify me by replying.
This email and any attached files are confidential and intended solely
for the addressee.
If you are not the intended recipient, then you must not:
Disclose the contents to anyone other than the intended addressee;
Copy or forward this email or any attachment;
Take any action in reliance on anything in the email or attachments.

HAL_9000 · 4 May 2004 16:27

Mike Calder wrote:

[snip]

But the original code that I code and test to work with &mylanguageofchoice;
MUST work with ANY other language, without change. All other language users
must have to do to get my program to work in their language is to translate
fixed strings (and perhaps change GUI layouts because of sizings). This is
absolutely basic and should be in the Ruby core. I thought this kind of
thing was why we have OO.

How you do it, I don’t care. I don’t care what the internal representation is;
that’s an implementation detail, the sort of thing that OO should be hiding
from me anyway. All I know is I’m handling a string, and I want to do string
manipulations. I don’t care is it’s Unicode or ASCII, Kanji or Kanuck.

Sorry, if I can’t do that in Ruby, it’s broken and, I’m afraid, unusable.

Can you do that with any language whatsoever? Don’t say Java, because
it’s limited to Unicode.

It’s not a flame. It’s an honest question. Is there any programming
language/environment anywhere where I18N issues are completely
abstracted away and everything is done “nicely”?

Hal

Robert · 4 May 2004 16:49

“Hal Fulton” hal9000@hypermetrics.com schrieb im Newsbeitrag
news:4097C475.1050000@hypermetrics.com…

Mike Calder wrote:

[snip]

But the original code that I code and test to work with
&mylanguageofchoice;
MUST work with ANY other language, without change. All other language
users
must have to do to get my program to work in their language is to
translate
fixed strings (and perhaps change GUI layouts because of sizings). This
is
absolutely basic and should be in the Ruby core. I thought this kind of
thing was why we have OO.

How you do it, I don’t care. I don’t care what the internal
representation is;
that’s an implementation detail, the sort of thing that OO should be
hiding
from me anyway. All I know is I’m handling a string, and I want to do
string
manipulations. I don’t care is it’s Unicode or ASCII, Kanji or Kanuck.

Sorry, if I can’t do that in Ruby, it’s broken and, I’m afraid,
unusable.

Can you do that with any language whatsoever? Don’t say Java, because
it’s limited to Unicode.

It’s not a flame. It’s an honest question. Is there any programming
language/environment anywhere where I18N issues are completely
abstracted away and everything is done “nicely”?

Dunno, but if I had to guess I’d point to Eiffel or maybe a newer version of
Ada…

robert

Mike_Calder · 4 May 2004 17:20

No flame understood, and none intended here either, but I didn’t ask for I18N
issues to be completely abstracted away. I said I wanted to be able to write
in one language, to not have to consider the string encoding when writing
code, and for any other language user to be able to use the code just by
translating the strings. I probably misunderstand the situation (I often do
until people beat the details into my head with hammers) but as far as I can
see and in my practical experience Unicode (and by transference Java) gives
me an “abstracted-enough” solution, for my applications with new data
generated by my applications.

If I understood the nature of the problems with Unicode stated earlier in this
thread (and I must admit I skimmed a lot of it) it is based on there not
being a unique encoding for all possible glyph/character representations in
all languages (e.g., 1 unicode for a glyph which represents different
characters in different languages), legacy encodings and legacy encoded data,
and so on. So you tell me Unicode doesn’t hack it and you want to use
something else. I have no problem with that; I’m not emotionally attached to
Unicode (or Java, come to that - that’s why I’m looking, after all).

These problems may cause operational difficulties, and a totally clean
solution probably has to handle all of them. Meantime, in the real world
(and on my plate), there is a need for I18N which ASCII just doesn’t hack.

All I know is that using Java and Unicode I can write code in English, and
have people translate it to any European language (including Cyrillic and
Greek), into Hindi and Japanese, use it with those langauges, and it looks
and behaves right in that target language. I know, because it’s both
required and been done on code I’ve written.

The only points I’m making are that;

When you get round to implementing whatever (let’s call it SuperCode) in
Ruby, it should be part of the core and transparent to the user of String
methods, and
Until Ruby does with SuperCode or whatever what Java does today with
Unicode, it’s no use to me for production code and remains an (albeit pretty)
toy as far as my practical applications are concerned.

···

On Tuesday 04 May 2004 17:27, Hal Fulton wrote:

Can you do that with any language whatsoever? Don’t say Java, because
it’s limited to Unicode.

It’s not a flame. It’s an honest question. Is there any programming
language/environment anywhere where I18N issues are completely
abstracted away and everything is done “nicely”?

Hal

–
Clear skies!
Mike Calder.

Yukihiro_Matsumoto2 · 5 May 2004 13:05

Hi,

When you get round to implementing whatever (let’s call it SuperCode) in
Ruby, it should be part of the core and transparent to the user of String
methods, and

Don’t worry. It would be transparent except you have to declare your
favorite encoding once somewhere somehow, unless you want to treat
multiple encodings in one application.

Until Ruby does with SuperCode or whatever what Java does today with
Unicode, it’s no use to me for production code and remains an (albeit pretty)
toy as far as my practical applications are concerned.

Right. Vaporware does not help you.

						matz.

···

In message “Re: Strange behaviour of Strings in Range” on 04/05/05, Mike Calder ceo@phosco.com writes:

Topic		Replies	Views
Strange behaviour of Strings in Range ruby-talk	0	120	5 May 2004
Unicode roadmap? ruby-talk	262	1361	1 June 2007
Andy Roonie is perhaps excessively optimistic ruby-talk	3	193	1 July 2002
Unicode in Ruby now? ruby-talk	51	475	23 December 2004
Unicode roadmap? ruby-talk	36	206	29 June 2006

Strange behaviour of Strings in Range

Related topics