Since there's been a lot of talk about Unicode lately, I thought I'd throw out a Ruby library I've been working on to support Unicode characters and strings based on the 4.1.0 standard and key specifications from the Unicode Consortium.
ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2
The library adds an encoding property to native String objects, and allows conversion to and from Unicode::String and Unicode::Character. A default encoding is chosen based on $KCODE, or the default can be set/changed explicitly via String.default_encoding.
Unicode strings can be obtained by applying the + unary operator to native strings, e.g. +"Hello" (where the native string is encoded in the default encoding).
% irb -I. -runicode -Ku
irb(main):001:0> ustr = +"π is pi"
=> +"π is pi"
Native strings are obtained from Unicode strings by calling to_s, which accepts an optional argument to indicate the desired encoding.
irb(main):002:0> str = ustr.to_s
=> "π is pi"
irb(main):003:0> str.encoding
=> Unicode::Encoding::UTF8
Individual characters can be indexed from Unicode strings, returning a Unicode::Character object.
irb(main):004:0> ustr[0]
=> U+03C0 GREEK SMALL LETTER PI
Case conversion is handled as with native strings.
irb(main):005:0> ustr.upcase
=> +"Π IS PI"
Normalization is accomplished with the ~ unary operator.
irb(main):006:0> ustr = +"mí"
=> +"mí"
irb(main):007:0> ustr.to_a
=> [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH ACUTE]
irb(main):008:0> (~ustr).each_char { |ch| p ch }
U+006D LATIN SMALL LETTER M
U+0069 LATIN SMALL LETTER I
U+0301 COMBINING ACUTE ACCENT
=> +"mí"
There is much more -- character properties, text boundaries (grapheme clusters and words), Hangul decompositions, modular encodings (ASCII, Latin1, EUC, SJIS, UTF32, UTF16, UTF8) -- yet the project is unfinished. If anyone is interested in helping develop it further, let me know.
The library incorporates the entire Unicode 4.1.0 Character Database (demand-loaded!) which is why the archive is rather large.
Cheers,
···
--
Rob Leslie
rob@mars.org