REXML & Extended characters - newbie question

I am doing a quick and dirty automatic translation from English to
spanish of some text in an xml document.

However the translation returns characters outsize the 7 bit range,
which seems to creates ain invalid xml document. I need those string
utf8 encoded before I set the text of an element. But I cant see how to
do this.

Thanks for any help

Regards
Ralph

A test doc looks like

<?xml version='1.0' encoding='UTF-8'?>

Vehicle

Full code.

require 'net/http’
require 'cgi’
require ‘rexml/document’

def translate(text)
puts "translating #{text}“
ret =”"
Net::HTTP.start(‘translate.google.com’){ |session|

session.get("/translate_t?langpair=en|es&hl=en&text=#{CGI.escape(text)}"){
>result> ret<< result
}
}
ret =~ /(name=q.?>)(.?)</
$2
end

def process(node)
puts node.name
node.text = translate(node.text) if ( node.text.strip != “” )
node.elements.each{|x| process x}
end

doc = REXML::Document.new File.new "lang_eng.xml"
doc.elements.each{|x| process x }
doc.write(File.new(“lang_spn.xml”,“w”),0)

Hi!

  • Ralph Mason:

I need those string utf8 encoded before I set the text of an
element.

IIRC the encoding used by Google defaults to ISO-8859-1 while adding
an explicit ‘en=utf-8’ to the argument part of the URL makes it use
utf-8.

Josef ‘Jupp’ SCHUGT

···


http://oss.erdfunkstelle.de/ruby/ - German comp.lang.ruby-FAQ
http://rubyforge.org/users/jupp/ - Ruby projects at Rubyforge
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Germany 2004: To boldly spy where no GESTAPO / STASI has spied before

Josef ‘Jupp’ SCHUGT wrote:

Hi!

  • Ralph Mason:

I need those string utf8 encoded before I set the text of an
element.

IIRC the encoding used by Google defaults to ISO-8859-1 while adding
an explicit ‘en=utf-8’ to the argument part of the URL makes it use
utf-8.

Josef ‘Jupp’ SCHUGT

Thanks for that, I’ll give it a go I had a workaround with

node.text = str.pack(“C*”).unpack(“U*”)

It would be good if there was some documentation somewhere about text
conversions and REXML. Or some kind of encoding aware string class
that could act as an intermediary.

Ralph