All this talk about Unicode support and HTML parsing got me to wondering about how to parse Japanese text. There are no spaces to separate words, and though there are some modifiers, or particles in the Japanese language they are used sometime inconsistently. I could quote examples, but of you can't read Kanji, Hiragana, and Katakana they would most likely be meaningless.
So, knowing what little I do of Japanese (been studying for a while and living in Japan for close to four years), I was wondering how search engines like Google and Yahoo parse Japanese text, much less web pages. There are numerous filters to extract text from web pages, but parsing Japanese text is another matter altogether.
So, I have found one Open Source project which seems to be addressing this, but I was wondering if there is a solution for Ruby?
Now for the trivia... I've been reading some Japanese text, "Hiragana Times" - a magazine which prints their articles in Japanese and English as a learning tool and my newspaper "The Japan Times" which has a weekly section devoted to bilingual education, as well as my class textbooks. I've also read some Manga as well. They generally present the Kanji with tiny Hiragana characters either above them which are the phonetic equivalent to the Kanji.
Guess what these tiny Hiragana helpers are called... you guessed it "Ruby Annotation". Check out what I found on W3C, either click on the link or: http://www.w3.org/TR/ruby/
Coincidence?
Mike
···
--
Mobile: +81-80-3202-2599
Office: +81-3-3395-6055
"Any sufficiently advanced technology is indistinguishable from magic..."
- A. C. Clarke