As you can see, the regular expression incorrectly captures not only
the text part but also the closing tag, whereas what is supposed to be
captured is just "\302".
Any ideas why the pattern matching doesnt work? I dont see anything
wrong with the regular expression. Although, Im not sure what the \A
character class is for.
The part of the xml that is causing the problem is in the <username>
tag which if in ruby, is represented with octals as:
"<username>(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"
Note that the XML says that the contents are in UTF-8. So when I use
REXML to process this xml, after it processes the the tag
"<username>", it is left with
string = "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"
I just checked and if I match this string with TEXT_PATTERN = /
\A([^<]*)/um, I get the text and also the close tag.
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"
Assuming that the XML has some malformed data (some are not actually
UTF-8), is there anyway that I can process the xml as it is and only
treat the malformed data differently? (e.g. you mentioned that the
\302 character is not a UTF-8 character)
Best regards,
Jesse
···
On Jan 5, 10:45 pm, Tiziano Merzi <giua...@gmail.com> wrote: