Problems making UTF-8 text XML/XHTML friendly (no entity conversion?)

Thomas_Stromberg · 31 May 2004 18:03

I’m downloading information from a website in UTF-8, but I want to
place the text I recieve into XML and XHTML output files. In a
nutshell, if I run into a u with an umlaut on top, like: ü. I would
like ruby to replace it with ü or ü or ü

I made some headway trying to use unpack(“U”) when I ran into high
ascii characters, but it did not seem to handle my test case very well.
REXML’s normalize function did not help either. Here is the test-case I
am trying to handle, from http://toadstool.se/temp/utf (you may want to
fetch the file instead of relying on what’s pasted below)

Es befinden sich 3 Streichholzschachteln im Cache, diese können gegen
Zündhölzer aus aller Welt getauscht werden.

Does anyone here have any recipes for replacing all UTF characters with
entities? If so, I would really appreciate the help. Thanks!

/ Thomas

Carlos · 31 May 2004 18:42

[Thomas Strömberg thomasNOMORESPAM@stromberg.org, 2004-05-31 20.03 CEST]

I’m downloading information from a website in UTF-8, but I want to
place the text I recieve into XML and XHTML output files. In a
nutshell, if I run into a u with an umlaut on top, like: ü. I would
like ruby to replace it with ü or ü or ü

I made some headway trying to use unpack(“U”) when I ran into high
ascii characters, but it did not seem to handle my test case very well.
REXML’s normalize function did not help either. Here is the test-case I
am trying to handle, from Paljon tuotteita ja palveluita (you may want to
fetch the file instead of relying on what’s pasted below)

Es befinden sich 3 Streichholzschachteln im Cache, diese können gegen
Zündhölzer aus aller Welt getauscht werden.

Does anyone here have any recipes for replacing all UTF characters with
entities? If so, I would really appreciate the help. Thanks!

I thought of this:

str.gsub(/[^ -~]/u) { |match| “&#{match.unpack(“U”)[0]};” }

but I fear it will be very slow…

(But with your test string it was very quick ;))

HTH.

Topic		Replies	Views
Problems making UTF-8 text XML/XHTML friendly (no entity conversion?) ruby-talk	0	111	1 June 2004
How does one transform UTF-8 encoded characters to ASCII? ruby-talk	13	141	25 May 2006
Converting UTF-8 to entities like 剛 ruby-talk	14	140	13 May 2009
REXML & HTMLentities incorrectly map to UTF-8 ruby-talk	12	155	5 November 2012
Ruby, Unicode, and HTML Entities Problem ruby-talk	4	208	26 September 2010

Problems making UTF-8 text XML/XHTML friendly (no entity conversion?)

Related topics