Using unicode in YAML

I've done my research, and it appears that the current ruby YAML
implementation doesn't really grok unicode. What I want to know is
whether anything has changed in this regard, or is likely to in the
future. It doesn't appear that syck has been updated since May '05,
but if it's something I could fix, I'd be willing to do it.

In any case, here's what I need to do: read in YAML files containing
strings in various languages including Japanese and write the same
strings back out unmolested. UTF-8 seems like the natural choice, but
the encoding could be different, so long as I can do some processing
and keep the strings human readable when I spit them back out. I don't
even need to modify the strings themselves, just modify sets of them
and output.

Are ruby and YAML just not an option here? Any other suggestions?

If you're not familiar, here's the basic problem. I have the following
YAML:

jp: "はい"

(If that doesn't display right, it's just the Japanese characters for
the word "yes".)

I read it in via YAML.load and I get:

{"\357\273\277jp"=>"\343\201\257\343\201\204"}

OK, not so bad, the UTF-8 indicator is on the front there, but I can
deal with that, and the six bytes in octal do indeed correspond to the
UTF-8 codes for the two characters I expect. The problem is when I try
to put this back out, and YAML decides to take my string and convert
it to binary data:

···

---
"\xEF\xBB\xBFjp": !binary |
  44Gv44GE

Blah.

Hello baumanj,

\xEF\xBB\xBF from your example above is the byte order mark (BOM)
that is needed to identify UTF-8-files. \357\273\277 are just the
octal numbers that mean \xEF\xBB\xBF in hex. Have you tried to remove
these three bytes manually before writing your string as YAML? I
currently have a similar problem, and I just found out that these
three bytes are imported (wrongly, I think) via YAML#load_file into
one of my objects (in your case: the "jp" key).

So try this:

      f = File.open("jp.txt", "r")
      raw = f.read
      f.close
      #remove bom
      raw_without_bom = raw[3..-1]
      #now change to yaml
      hash = YAML::load(raw_without_bom)

Hope that helps.

Happy new year!

Rainer