Problems making UTF-8 text XML/XHTML friendly (no entity conversion?)

I’m downloading information from a website in UTF-8, but I want to
place the text I recieve into XML and XHTML output files. In a
nutshell, if I run into a u with an umlaut on top, like: ü. I would
like ruby to replace it with ü or ü or ü

I made some headway trying to use unpack(“U”) when I ran into high
ascii characters, but it did not seem to handle my test case very well.
REXML’s normalize function did not help either. Here is the test-case I
am trying to handle, from http://toadstool.se/temp/utf (you may want to
fetch the file instead of relying on what’s pasted below)

Es befinden sich 3 Streichholzschachteln im Cache, diese können gegen
Zündhölzer aus aller Welt getauscht werden.

Does anyone here have any recipes for replacing all UTF characters with
entities? If so, I would really appreciate the help. Thanks!

/ Thomas

[Thomas Strömberg thomasNOMORESPAM@stromberg.org, 2004-05-31 20.03 CEST]

I’m downloading information from a website in UTF-8, but I want to
place the text I recieve into XML and XHTML output files. In a
nutshell, if I run into a u with an umlaut on top, like: ü. I would
like ruby to replace it with ü or ü or ü

I made some headway trying to use unpack(“U”) when I ran into high
ascii characters, but it did not seem to handle my test case very well.
REXML’s normalize function did not help either. Here is the test-case I
am trying to handle, from Paljon tuotteita ja palveluita (you may want to
fetch the file instead of relying on what’s pasted below)

Es befinden sich 3 Streichholzschachteln im Cache, diese können gegen
Zündhölzer aus aller Welt getauscht werden.

Does anyone here have any recipes for replacing all UTF characters with
entities? If so, I would really appreciate the help. Thanks!

I thought of this:

str.gsub(/[^ -~]/u) { |match| “&#{match.unpack(“U”)[0]};” }

but I fear it will be very slow…

(But with your test string it was very quick ;))

HTH.