A plan for another unicode string hack

Hi everyone.

I'm implementing yet another unicode string hacks. I'm trying to rewire String class so that it will act like Ruby 2.0 String class. (see http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html)

String literals will act as byte buffers, just as they used to. However, when creating string object by using constructor, you can optionally specify the encoding of the input string.

  String.new("\352\260\200", "utf-8")

Default value of the encoding is nil if $KCODE is not set or set to "none". Default encoding is 'utf-8' if $KCODE == 'u'. If encoding is nil, string objects will act just like old ruby strings we all know and love. If encoding is set to a specific charset, string's instance methods will act more reasonably according to its encoding. Following is the summary of what I'm thinking:

  String#encoding gives character encoding name (e.g. "utf-8")
  String#[index] returns character string if encoding is set. If the encoding is not set, it returns fixnum as it used to.
  String#[] is always encoding aware if encoding is set.
  String#slice is always byte buffer operation regardless of the encoding.
  String#size always returns the number of bytes in the string.
  String#length returns the number of characters in the string according to the encoding specified. If the encoding is not set, it's same as String#size.
  String#+ will return utf-8 encoded string if two string's encoding does not match.

  *, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop, count, delete, downcase, each, each_line, eql?, gsub, match, succ, scan, split, strip, sub, upcase, upto will be all encoding aware if encoding is set.

The reason I'm differentiating between 'size' and 'length' is because some libraries (like rails) depend on them returning the byte size of the string. Maybe we can establish a customs that 'size' for byte size and 'length' for the number of characters. Same reasoning goes for '[]' and 'slice'.

For now, it will support only utf-8 encoding as ruby's regexp doesn't seem to support encodings other than ascii and utf-8. (I could use iconv to convert encoding internally to utf-8 for each method call, but at the moment, I think it's probably too costly and not worth it.)

I would love to get some feedback on this. Matz's feedback will be especially great since I want to make this as much forward compatible as possible with Ruby 2.0.

Thanks!

Daesan

Dae San Hwang
daesan@gmail.com

This is a bad change.

#size and #length are synonymous now and should remain so. Add a new
method, like #character_count or something like that.

-austin

···

On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:

  String#size always returns the number of bytes in the string.
  String#length returns the number of characters in the string
according to the encoding specified. If the encoding is not set, it's
same as String#size.

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

I like these very much. Although the choice between and slice seem arbitrary (i.e. you could have swapped their meanings and it would have made just as much sense). #size vs. #length is perfect. and # being a Fixnum when their was no encoding but a character when there is is equally brilliant. I salute you sir!

···

On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:

The reason I'm differentiating between 'size' and 'length' is because some libraries (like rails) depend on them returning the byte size of the string. Maybe we can establish a customs that 'size' for byte size and 'length' for the number of characters. Same reasoning goes for '' and 'slice'.

For now, it will support only utf-8 encoding as ruby's regexp doesn't
seem to support encodings other than ascii and utf-8. (I could use
iconv to convert encoding internally to utf-8 for each method call,
but at the moment, I think it's probably too costly and not worth it.)

Regexp also supports EUC (which seems to work for EUC-KR as well as
EUC-JP, incidentally) and Shift_JIS. Nevertheless, I think that
starting with UTF-8 is the way to go.

I would love to get some feedback on this. Matz's feedback will be
especially great since I want to make this as much forward compatible
as possible with Ruby 2.0.

I think it's a great idea. If you want any implementation assistance,
I'd be glad to help (I've done quite a bit of Unicode hacking in
Ruby).

Paul.

···

On 14/06/06, Dae San Hwang <daesan@gmail.com> wrote:

Dae San Hwang wrote:

String literals will act as byte buffers, just as they used to. However, when creating string object by using constructor, you can optionally specify the encoding of the input string.

String.new("\352\260\200", "utf-8")

I'd like to have a different interface, using named parameters.

   String.new("\352\260\200", encoding: "utf-8")

or

   String.new("\352\260\200", :encoding => "utf-8")

That way it's easier to extend String later on.

Cheers,
Daniel

Dae San Hwang wrote:

The reason I'm differentiating between 'size' and 'length' is because
some libraries (like rails) depend on them returning the byte size of
the string. Maybe we can establish a customs that 'size' for byte size
and 'length' for the number of characters. Same reasoning goes for ''
and 'slice'.

Good idea. This separation of 'length' and 'size' methods is quite
reasonable, in my opinion.

#size and #length are synonymous now and should remain so. Add a new
method, like #character_count or something like that.

Say this to matz :slight_smile:

svg% cat b.rb
#!./ruby -ku
a = String.new("Peut-être qu'on n'était pas encore là ..", "utf-8")
p a.length
p a.size
svg%

svg% ./b.rb
39
42
svg%

old ruby_m17n implementation

Guy Decoux

Thanks for the kind words.

The reason I picked for encoding aware method is because String#[index] will be used to extract the letter and not the byte in Ruby 2.0 as mentioned in http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html

   so that "abc"[0] returns "a" instead of fixnum 97

A way to get a Nth byte of a byte buffer is probably still necessary and String#slice seems to be the logical one, I thought.

Dae San Hwang
daesan@gmail.com

···

On Jun 14, 2006, at 11:56 PM, Logan Capaldo wrote:

On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:

The reason I'm differentiating between 'size' and 'length' is because some libraries (like rails) depend on them returning the byte size of the string. Maybe we can establish a customs that 'size' for byte size and 'length' for the number of characters. Same reasoning goes for '' and 'slice'.

I like these very much. Although the choice between and slice seem arbitrary (i.e. you could have swapped their meanings and it would have made just as much sense). #size vs. #length is perfect. and # being a Fixnum when their was no encoding but a character when there is is equally brilliant. I salute you sir!

To the original poster - frankly I don't see the point of doing this all over again. If you want to have unicode handling
that way just grab it from my plugin. It's just that when you have to work with external libraries they
will not cooperate. I was using this String class in the wild for a few months, so trust me. It's not simply
because I "felt" like removing this functionality - it simply Broke Alot Of Stuff In A Variety Of Subtle Ways.

Separation of "size" and "length" is sensless because they are aliases in Ruby. It would be sensible
to have "byte_" prefixed methods for byte access, just as I had in my hacks plugin a while ago. It worked too.

What is this? Curiosity or you just want to delve into the dirty swamp of character handling for pure entertainment?

···

On 15-jun-2006, at 6:13, Suraj N. Kurapati wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dae San Hwang wrote:

The reason I'm differentiating between 'size' and 'length' is because
some libraries (like rails) depend on them returning the byte size of
the string. Maybe we can establish a customs that 'size' for byte size
and 'length' for the number of characters. Same reasoning goes for ''
and 'slice'.

Good idea. This separation of 'length' and 'size' methods is quite
reasonable, in my opinion.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

I will. Matz, please see above. :wink:

The problem I have with this change is that I know that in my code I
have used #length and #size interchangeably depending on which reads
better in context.

It's not a good, clear, and understandable change. It will *forever*
require looking in ri or other resources to remember which one counts
characters and which one counts bytes.

-austin

···

On 6/14/06, ts <decoux@moulon.inra.fr> wrote:

> #size and #length are synonymous now and should remain so. Add a new
> method, like #character_count or something like that.

Say this to matz :slight_smile:

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

This behaviour - of returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

···

On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:

On Jun 14, 2006, at 11:56 PM, Logan Capaldo wrote:

>
> On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:
>
>> The reason I'm differentiating between 'size' and 'length' is
>> because some libraries (like rails) depend on them returning the
>> byte size of the string. Maybe we can establish a customs that
>> 'size' for byte size and 'length' for the number of characters.
>> Same reasoning goes for '' and 'slice'.
>
> I like these very much. Although the choice between and slice
> seem arbitrary (i.e. you could have swapped their meanings and it
> would have made just as much sense). #size vs. #length is perfect.
> and # being a Fixnum when their was no encoding but a character
> when there is is equally brilliant. I salute you sir!
>

Thanks for the kind words.

The reason I picked for encoding aware method is because String#
[index] will be used to extract the letter and not the byte in Ruby
2.0 as mentioned in http://redhanded.hobix.com/inspect/
futurismUnicodeInRuby.html

   so that "abc"[0] returns "a" instead of fixnum 97

It needs to be fixed for ruby 2.0 anyway. IO and some networking stuff
would need to be fixed to use byte_size I guess.

For me IO is sufficient for now.

Thanks

Michal

···

On 6/15/06, Julian 'Julik' Tarkhanov <listbox@julik.nl> wrote:

On 15-jun-2006, at 6:13, Suraj N. Kurapati wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Dae San Hwang wrote:
>> The reason I'm differentiating between 'size' and 'length' is because
>> some libraries (like rails) depend on them returning the byte size of
>> the string. Maybe we can establish a customs that 'size' for byte
>> size
>> and 'length' for the number of characters. Same reasoning goes for
>> ''
>> and 'slice'.
>
> Good idea. This separation of 'length' and 'size' methods is quite
> reasonable, in my opinion.

To the original poster - frankly I don't see the point of doing this
all over again. If you want to have unicode handling
that way just grab it from my plugin. It's just that when you have to
work with external libraries they
will not cooperate. I was using this String class in the wild for a
few months, so trust me. It's not simply
because I "felt" like removing this functionality - it simply Broke
Alot Of Stuff In A Variety Of Subtle Ways.

To the original poster - frankly I don't see the point of doing this all over again. If you want to have unicode handling
that way just grab it from my plugin. It's just that when you have to work with external libraries they
will not cooperate. I was using this String class in the wild for a few months, so trust me. It's not simply
because I "felt" like removing this functionality - it simply Broke Alot Of Stuff In A Variety Of Subtle Ways.

Hi Julian. I have tried your plugin in the past and I appreciate your efforts on better unicode supports on Ruby. The reason I'm proposing a different hack is because people have been advising against the use of your unicode hack due to its incompatibilities with other libraries. So, I figured that we need a way of differentiating between plain old string and new hacked string with explicit encoding. (I got the hint and inspiration from http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html\) That way it can be backward compatible with existing libraries and yet be forward compatible with Ruby 2.0.

Separation of "size" and "length" is sensless because they are aliases in Ruby. It would be sensible
to have "byte_" prefixed methods for byte access, just as I had in my hacks plugin a while ago. It worked too.

My proposal for differentiating method names between 'size' and 'length' has risen from my personal itch. I have always appreciated Ruby's intuitiveness and I think 'size' is an intuitively better name for byte size of a string and 'length' is better suited to give the length of a string. I might be being compulsive here but I think this kind of attention to details have earned the title of the programer friendly language to Ruby. Guy Decoux have pointed out that Matz has considered this change himself once and obviously many people on the forum welcome this change. (Equal number of people voted against it as well, 7:7 at the moment.)

Some people have pointed out that 'size' doesn't give byte size in other classes like array or hash but what matters here is the context. 'size' meaning byte size in the context of string object is pretty damn intuitive in my opinion. Ruby have used 'size' and 'length' to mean the same thing in the past but I believe that decision was consciously made by Matz thinking that people would prefer to use 'size' when they are using the string as byte buffer and use 'length' when they are using the string as character string. (I wouldn't know what Matz was thinking when he designed the String API but that's my guess.) Regardless of my feelings on this issue, I will just follow what Matz decides for Ruby 2.0 String API as one of my goals here is to provide forward compatibilities as much as possible.

Thanks to everyone who replied. I appreciate all your comments and will post back when I get something working.

Best regards,

Daesan

Dae San Hwang
daesan@gmail.com

···

On Jun 15, 2006, at 9:03 PM, Julian 'Julik' Tarkhanov wrote:

I've never been a fan of the Ruby practice of having many names for
the same thing, but I'm willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with "length" for one and "size" for the other?

···

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:

... in my code I
have used #length and #size interchangeably depending on which reads
better in context.

--
R. Mark Volkmann
Object Computing, Inc.

..returning different *type* values I mean..

···

On 6/14/06, Leslie Viljoen <leslieviljoen@gmail.com> wrote:

On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:
> On Jun 14, 2006, at 11:56 PM, Logan Capaldo wrote:
>
> >
> > On Jun 14, 2006, at 10:47 AM, Dae San Hwang wrote:
> >
> >> The reason I'm differentiating between 'size' and 'length' is
> >> because some libraries (like rails) depend on them returning the
> >> byte size of the string. Maybe we can establish a customs that
> >> 'size' for byte size and 'length' for the number of characters.
> >> Same reasoning goes for '' and 'slice'.
> >
> > I like these very much. Although the choice between and slice
> > seem arbitrary (i.e. you could have swapped their meanings and it
> > would have made just as much sense). #size vs. #length is perfect.
> > and # being a Fixnum when their was no encoding but a character
> > when there is is equally brilliant. I salute you sir!
> >
>
> Thanks for the kind words.
>
> The reason I picked for encoding aware method is because String#
> [index] will be used to extract the letter and not the byte in Ruby
> 2.0 as mentioned in http://redhanded.hobix.com/inspect/
> futurismUnicodeInRuby.html
>
> so that "abc"[0] returns "a" instead of fixnum 97

This behaviour - of returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

To the original poster - frankly I don't see the point of doing this all over again. If you want to have unicode handling
that way just grab it from my plugin. It's just that when you have to work with external libraries they
will not cooperate. I was using this String class in the wild for a few months, so trust me. It's not simply
because I "felt" like removing this functionality - it simply Broke Alot Of Stuff In A Variety Of Subtle Ways.

Hi Julian. I have tried your plugin in the past and I appreciate your efforts on better unicode supports on Ruby. The reason I'm proposing a different hack is because people have been advising against the use of your unicode hack due to its incompatibilities with other libraries. So, I figured that we need a way of differentiating between plain old string and new hacked string with explicit encoding. (I got the hint and inspiration from http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html\) That way it can be backward compatible with existing libraries and yet be forward compatible with Ruby 2.0.

Interesting what you are going to come up with. Especially when you pass a "flagged" string to routines such as CGI.escape which cannot tolerate codepoint-based String#size.

Separation of "size" and "length" is sensless because they are aliases in Ruby. It would be sensible
to have "byte_" prefixed methods for byte access, just as I had in my hacks plugin a while ago. It worked too.

My proposal for differentiating method names between 'size' and 'length' has risen from my personal itch. I have always appreciated Ruby's intuitiveness and I think 'size' is an intuitively better name for byte size of a string and 'length' is better suited to give the length of a string. I might be being compulsive here but I think this kind of attention to details have earned the title of the programer friendly language to Ruby. Guy Decoux have pointed out that Matz has considered this change himself once and obviously many people on the forum welcome this change. (Equal number of people voted against it as well, 7:7 at the moment.)

I'm really eager to see if it works out for you. Ples keep us posted.

···

On 15-jun-2006, at 16:26, Dae San Hwang wrote:

On Jun 15, 2006, at 9:03 PM, Julian 'Julik' Tarkhanov wrote:

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

It's all code context. "name.length" reads better than "name.size" and
"box.size" reads better than "box.length". Remember, in Ruby you
*don't* know whether you're dealing with a String, Array, or Hash (or
something else) when you're dealing with simple method calls.
Similarly, I will use #map most of the time, but sometimes I'll use
#collect.

In any case, these are well-established names and having them differ
would be problematic. That *said*, I'll have to fix stuff in Ruby 2
for PDF::Writer because I'm currently doing byte counting, not
character counting.

-austin

···

On 6/14/06, Mark Volkmann <r.mark.volkmann@gmail.com> wrote:

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:
> ... in my code I
> have used #length and #size interchangeably depending on which reads
> better in context.
I've never been a fan of the Ruby practice of having many names for
the same thing, but I'm willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with "length" for one and "size" for the other?

--
Austin Ziegler * halostatue@gmail.com * http://www.halostatue.ca/
               * austin@halostatue.ca * You are in a maze of twisty little passages, all alike. // halo • statue
               * austin@zieglers.ca

Leslie Viljoen wrote:

> so that "abc"[0] returns "a" instead of fixnum 97

This behaviour - of returning different values depending on the
argument has always made me a bit crazy. Does anyone know why it was
done that way?

..returning different *type* values I mean..

I've heard it's due to be fixed by end of next year.

Now, to Ruby's strings, a character is a byte, represented by a Fixnum.

The new Ruby character will be a string:

?c #=> "c"
"c"[0] #=> "c"
"c"[0].ord #=> 99

Cheers,
Dave

···

On 6/14/06, Leslie Viljoen <leslieviljoen@gmail.com> wrote:

On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:

> ... in my code I
> have used #length and #size interchangeably depending on which reads
> better in context.
I've never been a fan of the Ruby practice of having many names for
the same thing, but I'm willing to be convinced. Can you give me an
example of two string variables where getting the number of characters
reads better with "length" for one and "size" for the other?

It's all code context. "name.length" reads better than "name.size" and
"box.size" reads better than "box.length". Remember, in Ruby you
*don't* know whether you're dealing with a String, Array, or Hash (or
something else) when you're dealing with simple method calls.
Similarly, I will use #map most of the time, but sometimes I'll use
#collect.

Are you sure that "box" happened to be a variable for a string object? :wink:

In any case, these are well-established names and having them differ
would be problematic. That *said*, I'll have to fix stuff in Ruby 2
for PDF::Writer because I'm currently doing byte counting, not
character counting.

My proposed change won't disturb anyone's existing codes unless you set $KCODE to be 'u' as well in that code. If you did set $KCODE to 'u' in your previous projects, you don't have to apply this hack (which hasn't been implemented yet) to that project.

Matz has said several times that he will maximize the breakage moving to Ruby 2.0. If Matz is going to make these changes for Ruby 2.0, (as implied in Guy Decoux's posting) I think I will just follow along. My goal is to provide Ruby 2.0 forward compatible unicode support until the move is complete.

Dae San Hwang
daesan@gmail.com

···

On Jun 15, 2006, at 12:30 AM, Austin Ziegler wrote:

On 6/14/06, Mark Volkmann <r.mark.volkmann@gmail.com> wrote:

On 6/14/06, Austin Ziegler <halostatue@gmail.com> wrote:

Yahoo!

···

On 6/15/06, Dave Burt <dave@burt.id.au> wrote:

Leslie Viljoen wrote:
> On 6/14/06, Leslie Viljoen <leslieviljoen@gmail.com> wrote:
>> On 6/14/06, Dae San Hwang <daesan@gmail.com> wrote:
>> > so that "abc"[0] returns "a" instead of fixnum 97
>>
>> This behaviour - of returning different values depending on the
>> argument has always made me a bit crazy. Does anyone know why it was
>> done that way?
>
> ..returning different *type* values I mean..

I've heard it's due to be fixed by end of next year.

Now, to Ruby's strings, a character is a byte, represented by a Fixnum.

The new Ruby character will be a string:

?c #=> "c"
"c"[0] #=> "c"
"c"[0].ord #=> 99