Thanks for the replies. Actually, as I was doing something else,
another option occurred to me which seems both to a) work properly and
b) be safe(-ish):
irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> s = "€"
=> "€"
irb(main):003:0> x = s.dump
=> "\"\\342\\202\\254\""
irb(main):004:0> t = ""
=> ""
irb(main):005:0> t.instance_eval x
=> "€"
Since all I ever want is to have the data back in the string, and string
doesn't have any methods that are likely to cause problems, this might
be a reasonable short-to-medium term solution. It still makes me a bit
uncomfortable though, because I don't really want anything other than
the encoded characters handled.
I can't use Marshal, because I need to have the data available as plain
text (hence the quoted strings) which isn't necessarily guaranteed to be
always processed by Ruby. I chose String#dump because it seemed like it
would always generate a "safe" string that would be parsed using normal
quote literal recognition. I hadn't tested it until recently with lots
of Unicode data, because I simply hadn't gotten there yet. I was just
lucky...
Even the Unicode handling is straightforward enough, and since I posed
the question, I found this blog:
http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
which talks about modifying the JSON parser approach. I might be able
to do that, or, I might need to end up writing my own
serializer/deserializer, since at this stage (over a year), I've a lot
of legacy data lying around that was created with this approach.
I guess, I could write a one-off clean-up utility for the data that I
have now and then use the JSON library just to encode/decode the
strings, but that seems like overkill.
My goals here are interoperability, reuse, ease of adapting to my
existing code (in that order). Until I ran across the site, I hadn't
thought about the JSON approach, but it might make the most sense for
interoperable data. Mind you, I only care about safe string
serialization/deserialization, and I've no use in the application for
the rest of the JSON spec.
Changing the question a little: does anyone know of the best way to
serialize and parse strings containing Unicode and other non-printing
characters? Ideally, I'd like to have something that works like
String#dump except that it used escaped Unicode code point references,
e.g. \uxxxx and \Uxxxxxxxx, and handles all of the "usual suspects" like
\", \\, etc.
Doing some more googling, I also came across this, but I'm not sure what
the status of it is, and I'm not sure that it addresses my issue either.
It seems to be more about processing Unicode rather than serialization
of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527\).
[much time passes...including lunch]
After arsing around for a long time with various stupid stuff, I finally
came up with this. I don't really like it, but it seems to do the job.
Comments welcome:
irb(main):026:0> euro = "€"
=> "€"
irb(main):027:0> x = euro.dump
=> "\"\\342\\202\\254\""
irb(main):028:0> x.gsub(/\\(\d\d\d)/) { [ $1.oct ].pack("c") }[1..-2]
=> "€"
However, this doesn't get me in/out of the "standard" Unicode escapes.
Thanks in advance for any ideas or suggestions.
Cheers,
ast
···
On Thu, 2009-04-23 at 20:41 +0900, Urabe Shyouhei wrote:
Andrew S. Townley wrote:
> Which, of course, works. However, I'm a bit leery of doing this from a
> safety perspective, because I really don't have any control over these
> strings, and I'd prefer not to allow the execution of arbitrary Ruby
> code every time I'm trying to restore strings (I need them serialized as
> appropriately escaped quoted literals).
IMHO it is a bad idea to use String#dump when you cannot control those strings.
My recommendation is to use Marshal.dump, which also generates a string.
Adding quotes to those marshal-generated strings should be easier than safely
evaluate dumped string.
--
Andrew S. Townley <ast@atownley.org>
http://atownley.org