Ruby/Unicode library

Rob_Leslie · 18 June 2006 18:11

Since there's been a lot of talk about Unicode lately, I thought I'd throw out a Ruby library I've been working on to support Unicode characters and strings based on the 4.1.0 standard and key specifications from the Unicode Consortium.

ftp://ftp.mars.org/pub/ruby/Unicode.tar.bz2

The library adds an encoding property to native String objects, and allows conversion to and from Unicode::String and Unicode::Character. A default encoding is chosen based on $KCODE, or the default can be set/changed explicitly via String.default_encoding.

Unicode strings can be obtained by applying the + unary operator to native strings, e.g. +"Hello" (where the native string is encoded in the default encoding).

   % irb -I. -runicode -Ku
   irb(main):001:0> ustr = +"π is pi"
   => +"π is pi"

Native strings are obtained from Unicode strings by calling to_s, which accepts an optional argument to indicate the desired encoding.

   irb(main):002:0> str = ustr.to_s
   => "π is pi"
   irb(main):003:0> str.encoding
   => Unicode::Encoding::UTF8

Individual characters can be indexed from Unicode strings, returning a Unicode::Character object.

irb(main):004:0> ustr[0]
=> U+03C0 GREEK SMALL LETTER PI

Case conversion is handled as with native strings.

irb(main):005:0> ustr.upcase
=> +"Π IS PI"

Normalization is accomplished with the ~ unary operator.

   irb(main):006:0> ustr = +"mí"
   => +"mí"
   irb(main):007:0> ustr.to_a
   => [U+006D LATIN SMALL LETTER M, U+00ED LATIN SMALL LETTER I WITH ACUTE]
   irb(main):008:0> (~ustr).each_char { |ch| p ch }
   U+006D LATIN SMALL LETTER M
   U+0069 LATIN SMALL LETTER I
   U+0301 COMBINING ACUTE ACCENT
   => +"mí"

There is much more -- character properties, text boundaries (grapheme clusters and words), Hangul decompositions, modular encodings (ASCII, Latin1, EUC, SJIS, UTF32, UTF16, UTF8) -- yet the project is unfinished. If anyone is interested in helping develop it further, let me know.

The library incorporates the entire Unicode 4.1.0 Character Database (demand-loaded!) which is why the archive is rather large.

Cheers,

···

--
Rob Leslie
rob@mars.org

Julian_Julik_Tarkhan · 18 June 2006 18:51

Holy wow. But the tables are just _huge_.

···

On 18-jun-2006, at 20:11, Rob Leslie wrote:

Since there's been a lot of talk about Unicode lately, I thought I'd throw out a Ruby library I've been working on to support Unicode characters and strings based on the 4.1.0 standard and key specifications from the Unicode Consortium.

--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl

Rob_Leslie · 18 June 2006 19:12

I should point out that I'm not presently using most of these tables; Unihan.txt alone is 27M. They're included purely for completeness as I've been developing the library.

No doubt the actual data storage requirements can be reduced considerably.

···

On Jun 18, 2006, at 11:51 AM, Julian 'Julik' Tarkhanov wrote:

Since there's been a lot of talk about Unicode lately, I thought I'd throw out a Ruby library I've been working on to support Unicode characters and strings based on the 4.1.0 standard and key specifications from the Unicode Consortium.

Holy wow. But the tables are just _huge_.

--
Rob Leslie
rob@mars.org

Paul_Battley · 18 June 2006 21:33

That's an impressive achievement. It looks like a textbook
implementation. Thanks for sharing!

Coincidentally, I just dug up my own dormant UnicodeData.txt-based
effort - nowhere near as developed as yours - and hacked a bit on it
today, trying out some storage-reduction ideas. I'm looking forward to
trying things with your library.

Paul.

···

On 18/06/06, Rob Leslie <rob@mars.org> wrote:

I should point out that I'm not presently using most of these tables;
Unihan.txt alone is 27M. They're included purely for completeness as
I've been developing the library.

No doubt the actual data storage requirements can be reduced
considerably.

Topic		Replies	Views
Unicode in Ruby and a Ruby Reference ruby-talk	9	125	15 December 2004
Ruby unicode./encoding support ruby-talk	9	71	4 June 2003
[ANN] UnicodeUtils 1.3.0 - case conversion, normalization and more ruby-talk	0	137	7 March 2012
[ANN] UnicodeUtils 0.1.0 - more Unicode for Ruby 1.9 ruby-talk	1	132	28 October 2008
[ANN] UnicodeUtils 1.2.2 - case conversion, normalization and more ruby-talk	1	145	28 November 2011

Ruby/Unicode library

Related topics