Hi everyone.
I'm implementing yet another unicode string hacks. I'm trying to rewire String class so that it will act like Ruby 2.0 String class. (see http://redhanded.hobix.com/inspect/futurismUnicodeInRuby.html)
String literals will act as byte buffers, just as they used to. However, when creating string object by using constructor, you can optionally specify the encoding of the input string.
String.new("\352\260\200", "utf-8")
Default value of the encoding is nil if $KCODE is not set or set to "none". Default encoding is 'utf-8' if $KCODE == 'u'. If encoding is nil, string objects will act just like old ruby strings we all know and love. If encoding is set to a specific charset, string's instance methods will act more reasonably according to its encoding. Following is the summary of what I'm thinking:
String#encoding gives character encoding name (e.g. "utf-8")
String#[index] returns character string if encoding is set. If the encoding is not set, it returns fixnum as it used to.
String#[] is always encoding aware if encoding is set.
String#slice is always byte buffer operation regardless of the encoding.
String#size always returns the number of bytes in the string.
String#length returns the number of characters in the string according to the encoding specified. If the encoding is not set, it's same as String#size.
String#+ will return utf-8 encoded string if two string's encoding does not match.
*, <<, <=>, ==, =~, capitalize, casecmp, center, chomp, chop, count, delete, downcase, each, each_line, eql?, gsub, match, succ, scan, split, strip, sub, upcase, upto will be all encoding aware if encoding is set.
The reason I'm differentiating between 'size' and 'length' is because some libraries (like rails) depend on them returning the byte size of the string. Maybe we can establish a customs that 'size' for byte size and 'length' for the number of characters. Same reasoning goes for '[]' and 'slice'.
For now, it will support only utf-8 encoding as ruby's regexp doesn't seem to support encodings other than ascii and utf-8. (I could use iconv to convert encoding internally to utf-8 for each method call, but at the moment, I think it's probably too costly and not worth it.)
I would love to get some feedback on this. Matz's feedback will be especially great since I want to make this as much forward compatible as possible with Ruby 2.0.
Thanks!
Daesan
Dae San Hwang
daesan@gmail.com