Ruby/Unicode library

Since there's been a lot of talk about Unicode lately, I thought I'd throw out a Ruby library I've been working on to support Unicode characters and strings based on the 4.1.0 standard and key specifications from the Unicode Consortium.

   ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2

The library adds an encoding property to native String objects, and allows conversion to and from Unicode::String and Unicode::Character. A default encoding is chosen based on $KCODE, or the default can be set/changed explicitly via String.default_encoding.

Unicode strings can be obtained by applying the + unary operator to native strings, e.g. +"Hello" (where the native string is encoded in the default encoding).

   % irb -I. -runicode -Ku
   irb(main):001:0> ustr = +"π is pi"
   => +"π is pi"

Native strings are obtained from Unicode strings by calling to_s, which accepts an optional argument to indicate the desired encoding.

   irb(main):002:0> str = ustr.to_s
   => "π is pi"
   irb(main):003:0> str.encoding
   => Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning a Unicode::Character object.

   irb(main):004:0> ustr[0]
   => U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

   irb(main):005:0> ustr.upcase
   => +"Π IS PI"

Normalization is accomplished with the ~ unary operator.

   irb(main):006:0> ustr = +"mí"
   => +"mí"
   irb(main):007:0> ustr.to_a
   => [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH ACUTE]
   irb(main):008:0> (~ustr).each_char { |ch| p ch }
   U+006D LATIN SMALL LETTER M
   U+0069 LATIN SMALL LETTER I
   U+0301 COMBINING ACUTE ACCENT
   => +"mí"

There is much more -- character properties, text boundaries (grapheme clusters and words), Hangul decompositions, modular encodings (ASCII, Latin1, EUC, SJIS, UTF32, UTF16, UTF8) -- yet the project is unfinished. If anyone is interested in helping develop it further, let me know.

The library incorporates the entire Unicode 4.1.0 Character Database (demand-loaded!) which is why the archive is rather large.

Cheers,

···

--
Rob Leslie
rob@mars.org

Holy wow. But the tables are just _huge_.

···

On 18-jun-2006, at 20:11, Rob Leslie wrote:

Since there's been a lot of talk about Unicode lately, I thought I'd throw out a Ruby library I've been working on to support Unicode characters and strings based on the 4.1.0 standard and key specifications from the Unicode Consortium.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

I should point out that I'm not presently using most of these tables; Unihan.txt alone is 27M. They're included purely for completeness as I've been developing the library.

No doubt the actual data storage requirements can be reduced considerably.

···

On Jun 18, 2006, at 11:51 AM, Julian 'Julik' Tarkhanov wrote:

Since there's been a lot of talk about Unicode lately, I thought I'd throw out a Ruby library I've been working on to support Unicode characters and strings based on the 4.1.0 standard and key specifications from the Unicode Consortium.

Holy wow. But the tables are just _huge_.

--
Rob Leslie
rob@mars.org

That's an impressive achievement. It looks like a textbook
implementation. Thanks for sharing!

Coincidentally, I just dug up my own dormant UnicodeData.txt-based
effort - nowhere near as developed as yours - and hacked a bit on it
today, trying out some storage-reduction ideas. I'm looking forward to
trying things with your library.

Paul.

···

On 18/06/06, Rob Leslie <rob@mars.org> wrote:

I should point out that I'm not presently using most of these tables;
Unihan.txt alone is 27M. They're included purely for completeness as
I've been developing the library.

No doubt the actual data storage requirements can be reduced
considerably.