Unicode escaping fun & games

Hi folks,

After my last question, I finally sat down and figured out how to easily
do the kinds of conversions I wanted (at least the Unicode UTF-8 part).
Here's what I came up with in the event that it may be useful to others
having to exchange encoded 7-bit data across environments.

excalibur$ cat utf8.rb
# Created: Thu Apr 23 17:03:23 IST 2009

···

#
# This is some quick code to deal with UTF-8 manipulation and
# serialization of 7-bit ASCII representations.

$KCODE='u'
require 'jcode'

def utf8_escape(str)
  s = ""
  str.each_char do |c|
    x = c.unpack("C")[0]
    if x < 128
      s << c
    else
      s << "\\u%04x" % c.unpack("U")[0]
    end
  end
  s
end

def utf8_unpack(str)
  str.gsub(/\\u([0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F])/) do
    [ $1.hex ].pack("U*")
  end
end

Running it:

excalibur$ irb
irb(main):001:0> require 'utf8'
=> true
irb(main):002:0> s = "Hello €!"
=> "Hello €!"
irb(main):003:0> t = utf8_escape(s)
=> "Hello \\u20ac!"
irb(main):004:0> u = utf8_unpack(t)
=> "Hello €!"
irb(main):005:0> s == u
=> true

excalibur$ irb
irb(main):001:0> s = "àcA绋féà"
=> "\303\240cA\347\273\213f\303\251\303\240"
irb(main):002:0> require 'utf8'
=> true
irb(main):003:0> s = "àcA绋féà"
=> "àcA绋féà"
irb(main):004:0> t = utf8_escape(s)
=> "\\u00e0cA\\u7ecbf\\u00e9\\u00e0"
irb(main):005:0> u = utf8_unpack(t)
=> "àcA绋féà"
irb(main):006:0> s == u
=> true

It may not be 100% bullet-proof, but it works for some simple examples
that I could find, so this may be as far as I need to go with that part.
The next step is to roll this into a one-pass string escaping routine so
you don't need to do a bunch of gsub calls.

Any suggestions, comments and improvements are welcome.

Cheers,

ast
--
Andrew S. Townley <ast@atownley.org>
http://atownley.org