Unicode roadmap?

Because it would be a disaster. You want real world examples? Take a
look at any of the pure Ruby code in the Win32Utils examples where I
have to take slices out of character buffers and pack or unpack them
into the appropriate value. I'm guessing this might apply to Ruby/DL as
well.

I'm sure there are *many* people using character access in "real world"
code.

Regards,

Dan

This communication is the property of Qwest and may contain confidential or
privileged information. Unauthorized use of this communication is strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and destroy
all copies of the communication and any attachments.

···

-----Original Message-----
From: Garance A Drosehn [mailto:drosihn@gmail.com]
Sent: Tuesday, June 27, 2006 3:56 PM
To: ruby-talk ML
Subject: Re: Unicode roadmap?

On 6/26/06, Daniel DeLorme <dan-ml@dan42.com> wrote:
>
> It's funny, maybe I'm just dumb but I can't think of a single
> *real-world* example where you'd want to access particular
characters
> of a string.

If that is the case, then why doesn't Ruby remove *all*
substring notation?

Raising my hand, but the question might be who does character access
on _Unicode_ strings. I play with byte arrays all the time (sometimes
with embedded strings), but very rarely (but I do) use string slicing
against a Unicode string.

I am in the (unfortunate) position of dealing with many legacy binary
files, encoded into a wide variety of pieces and parts -- I use string
slicing, but more exactly I use byte array slicing (don't get me wrong
-- I want to keep a single String class).

pth

···

On 6/27/06, Berger, Daniel <Daniel.Berger@qwest.com> wrote:

I'm sure there are *many* people using character access in "real world"
code.

Point granted, but I bet the Win32 stuff assumes 8-bit "characters" and thus fixed offsets. -Tim

···

On Jun 27, 2006, at 3:05 PM, Berger, Daniel wrote:

If that is the case, then why doesn't Ruby remove *all*
substring notation?

Because it would be a disaster. You want real world examples? Take a
look at any of the pure Ruby code in the Win32Utils examples where I
have to take slices out of character buffers and pack or unpack them
into the appropriate value. I'm guessing this might apply to Ruby/DL as
well.

I have an example, but I'm sure most would consider it "cheating". Let's say I need to write a Regexp engine... :wink:

···

On Jun 27, 2006, at 7:59 PM, Tim Bray wrote:

On Jun 27, 2006, at 3:05 PM, Berger, Daniel wrote:

If that is the case, then why doesn't Ruby remove *all*
substring notation?

Because it would be a disaster. You want real world examples? Take a
look at any of the pure Ruby code in the Win32Utils examples where I
have to take slices out of character buffers and pack or unpack them
into the appropriate value. I'm guessing this might apply to Ruby/DL as
well.

Point granted, but I bet the Win32 stuff assumes 8-bit "characters" and thus fixed offsets. -Tim

Tim Bray wrote:

If that is the case, then why doesn't Ruby remove *all*
substring notation?

Because it would be a disaster. You want real world examples? Take a
look at any of the pure Ruby code in the Win32Utils examples where I
have to take slices out of character buffers and pack or unpack them
into the appropriate value. I'm guessing this might apply to Ruby/DL as
well.

Point granted, but I bet the Win32 stuff assumes 8-bit "characters" and thus fixed offsets. -Tim

Maybe I should fork a version of Ruby tailored specifically to Windows. I'll replace all char pointer declarations with tchar pointers, set MBCS, and automatically convert all strings to wide strings using MultiByteToWideChar() behind the scenes, using whatever code page they want, defaulting to CP_UTF8.

Right after I get my VC funding. :wink:

Regards,

Dan

···

On Jun 27, 2006, at 3:05 PM, Berger, Daniel wrote:

>
> I'm sure there are *many* people using character access in "real world"
> code.

Raising my hand, but the question might be who does character access
on _Unicode_ strings. I play with byte arrays all the time (sometimes
with embedded strings), but very rarely (but I do) use string slicing
against a Unicode string.

I guess, that would be anyone in East Europe with Cyrillic-based
alphabets :slight_smile: Especially those dealing with web apps. *Sigh*

On the other hand, I wonder, who in his right mind would want to work
with _strings_ as with a sequence of bytes? :wink: 90% of developers out
there don't even know how encodings work. So all this manual to and
fro conversion, moving through bytes etc. etc. would only be perceived
as vodoo magic. *A deeper sigh*

I am in the (unfortunate) position of dealing with many legacy binary
files, encoded into a wide variety of pieces and parts -- I use string
slicing, but more exactly I use byte array slicing (don't get me wrong
-- I want to keep a single String class).

I wonder, what is wrong with making all strings Unicode by default
(this will ensure that most libraries don't automagically break once
Ruby is upgraded), _but_ let developers optionally decide whether they
want a different encoding:

s = new String #=> unicode string
sj = new String(:encoding => 'jis')
scp = new String(:encoding => 'CP1251')
sb = new String(:binary => true) #=> work as ByteArray
sbf = new String(:encoding => 'funny encoding', :binary => true) #=>
work as ByteArray

There are numerous performance issues, I suppose. And other problems
like assignment operations. *Sigh*

···

On 6/28/06, Patrick Hurley <phurley@gmail.com> wrote:

On 6/27/06, Berger, Daniel <Daniel.Berger@qwest.com> wrote:

Trust me. You don't want to do that. TCHAR with -DUNICODE is pure evil.

-austin

···

On 6/27/06, Daniel Berger <djberg96@gmail.com> wrote:

Maybe I should fork a version of Ruby tailored specifically to Windows.
I'll replace all char pointer declarations with tchar pointers, set
MBCS, and automatically convert all strings to wide strings using
MultiByteToWideChar() behind the scenes, using whatever code page they
want, defaulting to CP_UTF8.

Right after I get my VC funding. :wink:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

I'm confused - I thought we were talking about Ruby! :wink:

Paul.

···

On 28/06/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:

s = new String #=> unicode string
sj = new String(:encoding => 'jis')
scp = new String(:encoding => 'CP1251')
sb = new String(:binary => true) #=> work as ByteArray
sbf = new String(:encoding => 'funny encoding', :binary => true) #=>
work as ByteArray

IIRC, any "unknown" encoding will be treated as a binary string where
you're responsible for dealing with "characters".

I have suggested to Matz that we adopt the u"string" format so that we
have a literal constructor for Unicode strings (which is by far the
more common need).

-austin

···

On 6/28/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:

I wonder, what is wrong with making all strings Unicode by default
(this will ensure that most libraries don't automagically break once
Ruby is upgraded), _but_ let developers optionally decide whether they
want a different encoding:

s = new String #=> unicode string
sj = new String(:encoding => 'jis')
scp = new String(:encoding => 'CP1251')
sb = new String(:binary => true) #=> work as ByteArray
sbf = new String(:encoding => 'funny encoding', :binary => true) #=>
work as ByteArray

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Austin Ziegler wrote:

···

On 6/27/06, Daniel Berger <djberg96@gmail.com> wrote:

Maybe I should fork a version of Ruby tailored specifically to Windows.
I'll replace all char pointer declarations with tchar pointers, set
MBCS, and automatically convert all strings to wide strings using
MultiByteToWideChar() behind the scenes, using whatever code page they
want, defaulting to CP_UTF8.

Right after I get my VC funding. :wink:

Trust me. You don't want to do that. TCHAR with -DUNICODE is pure evil.

-austin

Well, it would be -DMBCS. :wink:

I cannot even begin to imagine what changes would be required for the regex engine.

- Dan

:slight_smile: Sorry. I currently have to work with C++, Ruby and PHP
simultaneously. It shows

I've come a cross a similar discussion on Unicode for Erlang. This
post in particular sums up some of the problems with Unicode:
    http://article.gmane.org/gmane.comp.lang.erlang.general/16021

There is, in particular, nice info on how SWI Prolog handles Unicode
and what problems the have

···

On 28/06/06, Dmitrii Dimandt <dmitriid@gmail.com> wrote:
> s = new String #=> unicode string
> sj = new String(:encoding => 'jis')
> scp = new String(:encoding => 'CP1251')
> sb = new String(:binary => true) #=> work as ByteArray
> sbf = new String(:encoding => 'funny encoding', :binary => true) #=>
> work as ByteArray

I'm confused - I thought we were talking about Ruby! :wink:

Hi,

···

In message "Re: Unicode roadmap?" on Wed, 28 Jun 2006 23:11:39 +0900, "Austin Ziegler" <halostatue@gmail.com> writes:

I have suggested to Matz that we adopt the u"string" format so that we
have a literal constructor for Unicode strings (which is by far the
more common need).

I am not sure how much this is more useful than usual string literals
plus unicode encoding pragma.

              matz.

In this regard I would love to see a user definable string quote
operator (see http://redhanded.hobix.com/inspect/userDefinedLiteralsInSydney.html\).
Then we could do it ourselves (and many other devious things as well).

pth

···

On 6/28/06, Austin Ziegler <halostatue@gmail.com> wrote:

I have suggested to Matz that we adopt the u"string" format so that we
have a literal constructor for Unicode strings (which is by far the
more common need).

I'm not sure I like the encoding pragma, personally, since it's at the
file level. Consider this:

  raise "Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

If I understand the encoding pragma correctly, both the "Not PNG" and
the matching string will be treated as Unicode, and the test string is
not valid Unicode.

Better, from my perspective:

  raise u"Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

That way, I *mark* the strings for which I want Unicode format. The
encoding pragma makes it hard to do mixed content files.

(This example, by the way, is *specifically* artificial, but the code
involved is real. It's image matching code with error messages if
there's a mismatch.)

-austin

···

On 6/28/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:

I have suggested to Matz that we adopt the u"string" format so that
we have a literal constructor for Unicode strings (which is by far
the more common need).

I am not sure how much this is more useful than usual string literals
plus unicode encoding pragma.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Austin, I don't understand why my strings are more "special" than yours and thus need subclassing, special encoding
or a special literal before them. This is by far the worst thing I fear - that multibyte strings are handled as being "special".
They are not special, they are the default.

With the pragma there is one thing which makes me wonder - will that mean that the libraries will have to check for the pragma
to do their job correctly?

···

On 28-jun-2006, at 19:12, Yukihiro Matsumoto wrote:

Hi,

In message "Re: Unicode roadmap?" > on Wed, 28 Jun 2006 23:11:39 +0900, "Austin Ziegler" > <halostatue@gmail.com> writes:

>I have suggested to Matz that we adopt the u"string" format so that we
>have a literal constructor for Unicode strings (which is by far the
>more common need).

I am not sure how much this is more useful than usual string literals
plus unicode encoding pragma.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Please no. Please please no.

What about:

raise "Not PNG." unless @top.bytes[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

···

On 28-jun-2006, at 19:33, Austin Ziegler wrote:

Better, from my perspective:

raise u"Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

That way, I *mark* the strings for which I want Unicode format. The
encoding pragma makes it hard to do mixed content files.

(This example, by the way, is *specifically* artificial, but the code
involved is real. It's image matching code with error messages if
there's a mismatch.)

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

I have suggested to Matz that we adopt the u"string" format so that
we have a literal constructor for Unicode strings (which is by far
the more common need).

I am not sure how much this is more useful than usual string literals
plus unicode encoding pragma.

Austin, I don't understand why my strings are more "special" than
yours and thus need subclassing,

Excuse me? I'm not the one who has been advocating separate classes. I
have, however, suggested that Unicode strings are going to be common
enough that rather than saying String.new("unicode-string", encoding:
:utf8) we have u"unicode-string". I *already* know that binary string
literals are quite common in Ruby. I use them a lot.

And I mix them with strings that would logically be represented in
Unicode.

If, however, you would prefer doing it the other way, we could go:

  raise "UnicodeString" unless a[0, 8] == b"binarystring"

Either way, I don't care. But neither needs to be a separate class. But
they should be mixable, and the pragma wouldn't make things easily
mixable.

special encoding or a special literal before them. This is by far the
worst thing I fear - that multibyte strings are handled as being
"special". They are not special, they are the default.

No, actually, they're *possibly* the default. More to the point,
*UNICODE* strings are not even necessarily going to be the default with
multibyte strings. So *even so* I would suggest that this might be
useful:

  raise u"UnicodeString" unless a[0, 8] == b"binarystring"
  "other-multibyte-string-according-to-pramga"

With the pragma there is one thing which makes me wonder - will that
mean that the libraries will have to check for the pragma to do their
job correctly?

I think the pragma is going to be a problem for mixed-content strings.

-austin

···

On 6/28/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

On 28-jun-2006, at 19:12, Yukihiro Matsumoto wrote:

In message "Re: Unicode roadmap?" >> on Wed, 28 Jun 2006 23:11:39 +0900, "Austin Ziegler" >> <halostatue@gmail.com> writes:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Hi,

···

In message "Re: Unicode roadmap?" on Thu, 29 Jun 2006 02:33:36 +0900, "Austin Ziegler" <halostatue@gmail.com> writes:

I'm not sure I like the encoding pragma, personally, since it's at the
file level. Consider this:

raise "Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

If I understand the encoding pragma correctly, both the "Not PNG" and
the matching string will be treated as Unicode, and the test string is
not valid Unicode.

Better, from my perspective:

raise u"Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

That way, I *mark* the strings for which I want Unicode format. The
encoding pragma makes it hard to do mixed content files.

I'd rather see r"\x89PNG\x0d\x0a\x1a\x0a" (or b"..."), since I expect
binary strings less often. It also removes unnecessary Unicode
expectation from users.

              matz.

Austin Ziegler wrote:

Better, from my perspective:

raise u"Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

It would make more sense if it worked exactly like regexes:
   $KCODE = 'u'
   raise "Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"n
or
   $KCODE = 'n'
   raise "Not PNG."u unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

Can I read the specs for ruby2 somewhere? It would be better than speculating about how the m17n strings might be implemented. I took a look at the 1.9 docs on ruby-doc.org but there is no 'encoding' accessor. It's all the same methods as the docs for ruby 1.8.4, although there are a bunch of methods which are *not* available in my install of 1.8.4: ["iseuc", "issjis", "isutf8", "kconv", "new", "scn", "toeuc", "tojis", "tosjis", "toutf16", "toutf8"]

Oh, here's another thought. How is *that* supposed to behave?
   str.encoding = :sjis
   str.split(//u)

Daniel

Except that @top is guaranteed to not have an encoding -- at least it
damned well better not -- and @top.bytes is redundant in this case. I
see no reason to access #bytes unless I know I'm dealing with a
multibyte String. Worse, why would "Not PNG." be treated as Unicode
under your scheme but "\x89PNG\x0d\x0a\x1a\x0a" not be? I don't think
you're thinking this through.

@top[0, 8] is sufficient when you can guarantee that sizeof(char) ==
sizeof(byte). On "raw" strings, this is always the case. On all
strings, @top[0, 8] would return the appropriate number of characters
-- not the number of bytes. It just so happens on binary strings that
the number of characters and bytes is exactly the same.

What I'm arguing is that while the pragma may work for the less-common
encodings, both binary (non-)encoding and Unicode (probably UTF-8) are
going to be common enough that specific literal constructors are
probably a very good idea.

-austin

···

On 6/28/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

On 28-jun-2006, at 19:33, Austin Ziegler wrote:
> Better, from my perspective:
> raise u"Not PNG." unless @top[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"
>
> That way, I *mark* the strings for which I want Unicode format. The
> encoding pragma makes it hard to do mixed content files.
>
> (This example, by the way, is *specifically* artificial, but the code
> involved is real. It's image matching code with error messages if
> there's a mismatch.)

Please no. Please please no.

What about:

raise "Not PNG." unless @top.bytes[0, 8] == "\x89PNG\x0d\x0a\x1a\x0a"

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca