Using unicode in YAML

baumanj · 11 December 2007 21:10

I've done my research, and it appears that the current ruby YAML
implementation doesn't really grok unicode. What I want to know is
whether anything has changed in this regard, or is likely to in the
future. It doesn't appear that syck has been updated since May '05,
but if it's something I could fix, I'd be willing to do it.

In any case, here's what I need to do: read in YAML files containing
strings in various languages including Japanese and write the same
strings back out unmolested. UTF-8 seems like the natural choice, but
the encoding could be different, so long as I can do some processing
and keep the strings human readable when I spit them back out. I don't
even need to modify the strings themselves, just modify sets of them
and output.

Are ruby and YAML just not an option here? Any other suggestions?

If you're not familiar, here's the basic problem. I have the following
YAML:

jp: "はい"

(If that doesn't display right, it's just the Japanese characters for
the word "yes".)

I read it in via YAML.load and I get:

{"\357\273\277jp"=>"\343\201\257\343\201\204"}

OK, not so bad, the UTF-8 indicator is on the front there, but I can
deal with that, and the six bytes in octal do indeed correspond to the
UTF-8 codes for the two characters I expect. The problem is when I try
to put this back out, and YAML decides to take my string and convert
it to binary data:

···

---
"\xEF\xBB\xBFjp": !binary |
44Gv44GE

Blah.

Rainer · 30 December 2007 21:40

Hello baumanj,

\xEF\xBB\xBF from your example above is the byte order mark (BOM)
that is needed to identify UTF-8-files. \357\273\277 are just the
octal numbers that mean \xEF\xBB\xBF in hex. Have you tried to remove
these three bytes manually before writing your string as YAML? I
currently have a similar problem, and I just found out that these
three bytes are imported (wrongly, I think) via YAML#load_file into
one of my objects (in your case: the "jp" key).

So try this:

      f = File.open("jp.txt", "r")
      raw = f.read
      f.close
      #remove bom
      raw_without_bom = raw[3..-1]
      #now change to yaml
      hash = YAML::load(raw_without_bom)

Hope that helps.

Happy new year!

Rainer

Topic		Replies	Views
YAML + ASCII Encoded Unicode ruby-talk	1	96	10 February 2009
To_yaml in utf-8 encoding ruby-talk	7	147	10 April 2011
Ruby 1.9, YAML & encodings ruby-talk	0	124	13 August 2008
BARRIER - ruby yaml - utf-8 characters not human readable ruby-talk	9	143	4 June 2011
To_yaml and international characters ruby-talk	14	166	13 November 2007

Using unicode in YAML

Related topics