Hi,
In <fec82c6b596b842bf6731e87991ed3cc@ruby-forum.com>
"Re: REXML & HTMLentities incorrectly map to UTF-8" on Mon, 5 Nov 2012 11:18:33 +0900,
Kouhei Sutou wrote in post #1082578:
Could you show me a sample Ruby code?
If I can reproduce your problem with the code on my machine, I
will fix the problem and the fix will be shipped in Ruby 2.0.0.
Here is some code to produce the problem, plus the input xml and the
output xml that I got when running the code. If you view the output in
an editor that shows hex code, you'll see that the apostrophe in
"fund's" becomes transliterated to char point C292 -- which is just an
unused control code.
The entity code used for the apostrophe is ’ which my Oreilly HTML
book indicates should indeed be rendered as an apostrophe.
Thanks for providing sample code.
First, "’" should be handled as U+0092 in XML.
See also:
Extensible Markup Language (XML) 1.0 (Fifth Edition)
If the character reference begins with " &#x ", the digits
and letters up to the terminating ; provide a hexadecimal
representation of the character's code point in ISO/IEC
10646. If it begins just with " &# ", the digits up to the
terminating ; provide a decimal representation of the
character's code point.
In your case, "&#" case. It means that 146 is handled as
decimal and it is 0x92 in hexadecimal. So ’ is U+0092
in XML.
(Note that XML is not HTML.)
But the problem is even worse. It turns out that if there is any HTML
tagging inside of the CDATA ... REXML deletes the data! Sometimes it
even hangs up with a "tree parsing error" (not exact text) with no
indication what source tag is giving the problem. (Sorry, can't provide
that sample input since its 15megs of semi-private data).
I can't reproduce your problem with the following script:
require "rexml/document"
document = REXML::Document.new(<<-EOX)
<notebook>
<note><![CDATA[<html>tag</html>]]></note>
</notebook>
EOX
note = document.elements["/notebook/note"]
cdata = note[0]
p cdata
# => "<html>tag</html>"
It seems that the output includes HTML tag in CDATA.
Thanks,
···
"Mark S." <lists@ruby-forum.com> wrote:
--
kou