I've done my research, and it appears that the current ruby YAML
implementation doesn't really grok unicode. What I want to know is
whether anything has changed in this regard, or is likely to in the
future. It doesn't appear that syck has been updated since May '05,
but if it's something I could fix, I'd be willing to do it.
In any case, here's what I need to do: read in YAML files containing
strings in various languages including Japanese and write the same
strings back out unmolested. UTF-8 seems like the natural choice, but
the encoding could be different, so long as I can do some processing
and keep the strings human readable when I spit them back out. I don't
even need to modify the strings themselves, just modify sets of them
and output.
Are ruby and YAML just not an option here? Any other suggestions?
If you're not familiar, here's the basic problem. I have the following
YAML:
jp: "はい"
(If that doesn't display right, it's just the Japanese characters for
the word "yes".)
I read it in via YAML.load and I get:
{"\357\273\277jp"=>"\343\201\257\343\201\204"}
OK, not so bad, the UTF-8 indicator is on the front there, but I can
deal with that, and the six bytes in octal do indeed correspond to the
UTF-8 codes for the two characters I expect. The problem is when I try
to put this back out, and YAML decides to take my string and convert
it to binary data:
···
---
"\xEF\xBB\xBFjp": !binary |
44Gv44GE
Blah.