Parsing String#dump data?

Andrew_S_Townley1 · 23 April 2009 10:03

Hi,

I was wondering if there was a better/safer way to parse string data
that has been dumped using the String#dump method. At the moment, I've
been using regular expressions to do it, but that doesn't seem to work
with unicode characters, since they get dumped as follows:

irb(main):007:0> s = "€"
=> "€"
irb(main):008:0> s.dump
=> "\"\\342\\202\\254\""

On a lark, I figured that Ruby would be able to get them back, so I
tried this:

irb(main):009:0> x = eval s.dump
=> "€"

Which, of course, works. However, I'm a bit leery of doing this from a
safety perspective, because I really don't have any control over these
strings, and I'd prefer not to allow the execution of arbitrary Ruby
code every time I'm trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

Has anyone else ever needed to do this, and, if so, how did you solve
the problem. I guess I could do another pass on the string looking for
'\\[\n]+' values and try and combine them some way, but I'm not really
sure how to do that either.

Any ideas?

Cheers,

ast

···

--
Andrew S. Townley <ast@atownley.org>
http://atownley.org

Forum · 23 April 2009 11:09

Hi,

I was wondering if there was a better/safer way to parse string data
that has been dumped using the String#dump method. At the moment, I've
been using regular expressions to do it, but that doesn't seem to work
with unicode characters, since they get dumped as follows:

irb(main):007:0> s = "€"
=> "€"
irb(main):008:0> s.dump
=> "\"\\342\\202\\254\""

On a lark, I figured that Ruby would be able to get them back, so I
tried this:

irb(main):009:0> x = eval s.dump
=> "€"

Maybe
eval s.dump if %r{\A".*"*\Z} === s.dump && ! %r{#[{@]} === s.dump # not tested
is save, but I am not 100% sure
Cheers
Robert

···

On Thu, Apr 23, 2009 at 12:03 PM, Andrew S. Townley <ast@atownley.org> wrote:

Which, of course, works. However, I'm a bit leery of doing this from a
safety perspective, because I really don't have any control over these
strings, and I'd prefer not to allow the execution of arbitrary Ruby
code every time I'm trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

Has anyone else ever needed to do this, and, if so, how did you solve
the problem. I guess I could do another pass on the string looking for
'\\[\n]+' values and try and combine them some way, but I'm not really
sure how to do that either.

Any ideas?

Cheers,

ast
--
Andrew S. Townley <ast@atownley.org>
http://atownley.org

--
Si tu veux construire un bateau ...
Ne rassemble pas des hommes pour aller chercher du bois, préparer des
outils, répartir les tâches, alléger le travail… mais enseigne aux
gens la nostalgie de l’infini de la mer.

If you want to build a ship, don’t herd people together to collect
wood and don’t assign them tasks and work, but rather teach them to
long for the endless immensity of the sea.

--
Antoine de Saint-Exupéry

Urabe_Shyouhei1 · 23 April 2009 11:41

Andrew S. Townley wrote:

Which, of course, works. However, I'm a bit leery of doing this from a
safety perspective, because I really don't have any control over these
strings, and I'd prefer not to allow the execution of arbitrary Ruby
code every time I'm trying to restore strings (I need them serialized as
appropriately escaped quoted literals).

IMHO it is a bad idea to use String#dump when you cannot control those strings.

My recommendation is to use Marshal.dump, which also generates a string.
Adding quotes to those marshal-generated strings should be easier than safely
evaluate dumped string.

Andrew_S_Townley1 · 23 April 2009 15:12

Thanks for the replies. Actually, as I was doing something else,
another option occurred to me which seems both to a) work properly and
b) be safe(-ish):

irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> s = "€"
=> "€"
irb(main):003:0> x = s.dump
=> "\"\\342\\202\\254\""
irb(main):004:0> t = ""
=> ""
irb(main):005:0> t.instance_eval x
=> "€"

Since all I ever want is to have the data back in the string, and string
doesn't have any methods that are likely to cause problems, this might
be a reasonable short-to-medium term solution. It still makes me a bit
uncomfortable though, because I don't really want anything other than
the encoded characters handled.

I can't use Marshal, because I need to have the data available as plain
text (hence the quoted strings) which isn't necessarily guaranteed to be
always processed by Ruby. I chose String#dump because it seemed like it
would always generate a "safe" string that would be parsed using normal
quote literal recognition. I hadn't tested it until recently with lots
of Unicode data, because I simply hadn't gotten there yet. I was just
lucky...

Even the Unicode handling is straightforward enough, and since I posed
the question, I found this blog:
http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
which talks about modifying the JSON parser approach. I might be able
to do that, or, I might need to end up writing my own
serializer/deserializer, since at this stage (over a year), I've a lot
of legacy data lying around that was created with this approach.

I guess, I could write a one-off clean-up utility for the data that I
have now and then use the JSON library just to encode/decode the
strings, but that seems like overkill.

My goals here are interoperability, reuse, ease of adapting to my
existing code (in that order). Until I ran across the site, I hadn't
thought about the JSON approach, but it might make the most sense for
interoperable data. Mind you, I only care about safe string
serialization/deserialization, and I've no use in the application for
the rest of the JSON spec.

Changing the question a little: does anyone know of the best way to
serialize and parse strings containing Unicode and other non-printing
characters? Ideally, I'd like to have something that works like
String#dump except that it used escaped Unicode code point references,
e.g. \uxxxx and \Uxxxxxxxx, and handles all of the "usual suspects" like
\", \\, etc.

Doing some more googling, I also came across this, but I'm not sure what
the status of it is, and I'm not sure that it addresses my issue either.
It seems to be more about processing Unicode rather than serialization
of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527\).

[much time passes...including lunch]

After arsing around for a long time with various stupid stuff, I finally
came up with this. I don't really like it, but it seems to do the job.
Comments welcome:

irb(main):026:0> euro = "€"
=> "€"
irb(main):027:0> x = euro.dump
=> "\"\\342\\202\\254\""
irb(main):028:0> x.gsub(/\\(\d\d\d)/) { [ $1.oct ].pack("c") }[1..-2]
=> "€"

However, this doesn't get me in/out of the "standard" Unicode escapes.

Thanks in advance for any ideas or suggestions.

Cheers,

ast

···

On Thu, 2009-04-23 at 20:41 +0900, Urabe Shyouhei wrote:

Andrew S. Townley wrote:
> Which, of course, works. However, I'm a bit leery of doing this from a
> safety perspective, because I really don't have any control over these
> strings, and I'd prefer not to allow the execution of arbitrary Ruby
> code every time I'm trying to restore strings (I need them serialized as
> appropriately escaped quoted literals).

IMHO it is a bad idea to use String#dump when you cannot control those strings.

My recommendation is to use Marshal.dump, which also generates a string.
Adding quotes to those marshal-generated strings should be easier than safely
evaluate dumped string.

--
Andrew S. Townley <ast@atownley.org>
http://atownley.org

Urabe_Shyouhei1 · 23 April 2009 19:18

Andrew S. Townley wrote:

Thanks for the replies. Actually, as I was doing something else,
another option occurred to me which seems both to a) work properly and
b) be safe(-ish):

irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> s = "€"
=> "€"
irb(main):003:0> x = s.dump
=> "\"\\342\\202\\254\""
irb(main):004:0> t = ""
=> ""
irb(main):005:0> t.instance_eval x
=> "€"

irb(main):001:0> t = ""
=> ""
irb(main):002:0> t.instance_eval "`ls`"
=> "tmp.txt\ntmp.rb\n"

Since all I ever want is to have the data back in the string, and string
doesn't have any methods that are likely to cause problems, this might
be a reasonable short-to-medium term solution. It still makes me a bit
uncomfortable though, because I don't really want anything other than
the encoded characters handled.

Be sure your string do not include something like `rm -rf` ...

I can't use Marshal, because I need to have the data available as plain
text (hence the quoted strings) which isn't necessarily guaranteed to be
always processed by Ruby. I chose String#dump because it seemed like it
would always generate a "safe" string that would be parsed using normal
quote literal recognition. I hadn't tested it until recently with lots
of Unicode data, because I simply hadn't gotten there yet. I was just
lucky...

How about Array#pack. It has an ability to escape strings as MIME
quoted-printable:

irb(main):001:0> s = "abcd€fghi"
=> "abcd€fghi"
irb(main):002:0> t = [s].pack("M")
=> "abcd=E2=82=ACfghi=\n"
irb(main):003:0> t.unpack("M")[0].force_encoding("UTF-8")
=> "abcd€fghi"

# that force_encoding thing is required for ruby 1.9.

Even the Unicode handling is straightforward enough, and since I posed
the question, I found this blog:
http://dilettantes.code4lib.org/2009/04/parsing-escaped-unicode-in-ruby/
which talks about modifying the JSON parser approach. I might be able
to do that, or, I might need to end up writing my own
serializer/deserializer, since at this stage (over a year), I've a lot
of legacy data lying around that was created with this approach.

JSON is a ruby's stdlib these days (1.9 and above). Using it might be easier
than you might think at first.

irb(main):001:0> require 'json'
=> true
irb(main):002:0> "€".to_json
=> "\"\\u20ac\""

I guess, I could write a one-off clean-up utility for the data that I
have now and then use the JSON library just to encode/decode the
strings, but that seems like overkill.

My goals here are interoperability, reuse, ease of adapting to my
existing code (in that order). Until I ran across the site, I hadn't
thought about the JSON approach, but it might make the most sense for
interoperable data. Mind you, I only care about safe string
serialization/deserialization, and I've no use in the application for
the rest of the JSON spec.

Generally speaking you cannot be safe with eval and eval-type methods used. So
You have to either (1) write your own deserializer without evals, or (2) use
existing one like JSON. I guess using existing libraries is not a bad idea for
interpoerabilities. So JSON might not be that overkill. Quoted-printable is
defined in RFC so might also be a good alternative.

Changing the question a little: does anyone know of the best way to
serialize and parse strings containing Unicode and other non-printing
characters? Ideally, I'd like to have something that works like
String#dump except that it used escaped Unicode code point references,
e.g. \uxxxx and \Uxxxxxxxx, and handles all of the "usual suspects" like
\", \\, etc.

If you want \uxxxx-style escape, JSON library is a best bet I think. Another
choice is to use YAML stdlib, but it generates backslashed escapes so you need
to convert them anyway.

···

Doing some more googling, I also came across this, but I'm not sure what
the status of it is, and I'm not sure that it addresses my issue either.
It seems to be more about processing Unicode rather than serialization
of Unicode to ASCII. (http://snippets.dzone.com/posts/show/4527\).

[much time passes...including lunch]

After arsing around for a long time with various stupid stuff, I finally
came up with this. I don't really like it, but it seems to do the job.
Comments welcome:

irb(main):026:0> euro = "€"
=> "€"
irb(main):027:0> x = euro.dump
=> "\"\\342\\202\\254\""
irb(main):028:0> x.gsub(/\\(\d\d\d)/) { [ $1.oct ].pack("c") }[1..-2]
=> "€"

However, this doesn't get me in/out of the "standard" Unicode escapes.

Thanks in advance for any ideas or suggestions.

Cheers,

ast

Topic		Replies	Views
String.undump ruby-talk	2	150	26 April 2004
Parse a String ruby-talk	4	107	23 January 2010
Healp reading / writing binary strings ruby-talk	4	120	18 June 2007
Read variables from string ruby-talk	5	85	25 July 2007
Easily parsing a string to retrieve values and assign them to a variable/symbol ruby-talk	6	119	18 July 2007

Parsing String#dump data?

Related topics