Regexp for matching UTF-8 characters without close tag

Jesse_P · 5 January 2008 09:24

Hi all,

Im trying to solve this problem:

string = "\302"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302"

As you can see, the regular expression incorrectly captures not only
the text part but also the closing tag, whereas what is supposed to be
captured is just "\302".

This problem is actually part of the REXML::Source#match method
(http://www.germane-software.com/projects/rexml/browser/trunk/src/
rexml/source.rb?rev=1266#L104) and causes REXML to parse UTF-8
documents incorrectly sometimes.

Any ideas why the pattern matching doesnt work? I dont see anything
wrong with the regular expression. Although, Im not sure what the \A
character class is for.

Best regards,

Jesse

Tiziano_Merzi · 5 January 2008 14:45

Jesse P. wrote:

Hi all,

Im trying to solve this problem:

string = "\302"
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "\302"

A solution may be :

require 'iconv'
string = "\302" # string isn't in utf-8 \302 in utf is \303\202
string = Iconv.conv("UTF-8","ISO-8859-1",string)
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
text_data = Iconv.conv("ISO-8859-1","UTF-8",text_data)

puts text_data

\A -> beginig of line

···

--
Posted via http://www.ruby-forum.com/\.

Jesse_P · 5 January 2008 15:39

Hi Tiziano,

My apologies. It seems that I have oversimplied the problem due to my
lack of understanding for UTF-8.

The actual string is an xml file I obtained from flickr at
http://api.flickr.com/services/rest/?method=flickr.people.getInfo&api_key=44dfd94b104d544f8f80b521a70429e3&user_id=55669962%40N00&api_sig=6a39aab2fb665e24d2b6e1cef9d0be27:
An excerpt is as follows:
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<person id="55669962@N00" nsid="55669962@N00" isadmin="0" ispro="0"
iconserver="136" iconfarm="1">
<username>(_.Â·Â´Â¯`Â·â"¢â(tm) â'ª Emirates Wizard â'ªâ(tm) â"¢Â·Â</

<realname />
 <mbox_sha1sum>0b88a178b28c40ff81d44c5ae475438abec2009c</mbox_sha1sum>
 <location />
 <photosurl>http://www.flickr.com/photos/emirates_wizard/</photosurl>
 <profileurl>http://www.flickr.com/people/emirates_wizard/</

<mobileurl>http://m.flickr.com/photostream.gne?id=5467956</mobileurl>
 <photos>
 <firstdatetaken>2006-07-16 15:22:42</firstdatetaken>
 <firstdate>1162548449</firstdate>
 <count>36</count>
 </photos>
</person>
</rsp>

The part of the xml that is causing the problem is in the <username>
tag which if in ruby, is represented with octals as:
"<username>(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

Note that the XML says that the contents are in UTF-8. So when I use
REXML to process this xml, after it processes the the tag
"<username>", it is left with
string = "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

I just checked and if I match this string with TEXT_PATTERN = /
\A([^<]*)/um, I get the text and also the close tag.
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
=> "(_.\302\267\302\264\302\257`\302\267\342\204\242\342\231
\342\202\252 Emirates Wizard \342\202\252\342\231
\342\204\242\302\267\302</username>"

Assuming that the XML has some malformed data (some are not actually
UTF-8), is there anyway that I can process the xml as it is and only
treat the malformed data differently? (e.g. you mentioned that the
\302 character is not a UTF-8 character)

Best regards,

Jesse

···

On Jan 5, 10:45 pm, Tiziano Merzi <giua...@gmail.com> wrote:

Jesse P. wrote:
> Hi all,

> Im trying to solve this problem:

> string = "\302"
> TEXT_PATTERN = /\A([^<]*)/um
> text_data = string.match(TEXT_PATTERN).to_s
> => "\302"

A solution may be :

require 'iconv'
string = "\302" # string isn't in utf-8 \302 in utf is \303\202
string = Iconv.conv("UTF-8","ISO-8859-1",string)
TEXT_PATTERN = /\A([^<]*)/um
text_data = string.match(TEXT_PATTERN).to_s
text_data = Iconv.conv("ISO-8859-1","UTF-8",text_data)

puts text_data

\A -> beginig of line
--
Posted viahttp://www.ruby-forum.com/.

Tiziano_Merzi · 5 January 2008 19:56

Jesse P. wrote:

Hi Tiziano,

My apologies. It seems that I have oversimplied the problem due to my
lack of understanding for UTF-8.

The actual string is an xml file I obtained from flickr at
http://api.flickr.com/services/rest/?method=flickr.people.getInfo&api_key=44dfd94b104d544f8f80b521a70429e3&user_id=55669962%40N00&api_sig=6a39aab2fb665e24d2b6e1cef9d0be27:

You have a broken utf-8 docoment (a read the Matz reply)

I see two solutions:

a)

require 'rexml/document'
require 'iconv'

data = your xml document
data = Iconv.conv("UTF-8","ISO-8859-1",data)
doc = REXML::Document.new(data)

you must convert the data of doc from utf-8 to iso befor use it

username = Iconv.conv("ISO-8859-1","UTF-8",doc. )

b) change the encoding of the xml

require 'rexml/document'

data = your xml document
data = data.gsub(/encoding="utf-8"/i, 'encoding="iso-8859-1"')
doc = REXML::Document.new(data)

Anyway a) and b) don't work if the document contains valid utf-8 chars
not in ascii-7 (for example latin letters è, ò, etc.)

···

--
Posted via http://www.ruby-forum.com/\.

Jesse_P · 6 January 2008 13:30

Thanks Tiziano

···

On Jan 6, 3:56 am, Tiziano Merzi <giua...@gmail.com> wrote:

Jesse P. wrote:
> Hi Tiziano,

> My apologies. It seems that I have oversimplied the problem due to my
> lack of understanding for UTF-8.

> The actual string is an xml file I obtained from flickr at
>http://api.flickr.com/services/rest/?method=flickr.people.getInfo&api\.\.\.

You have a broken utf-8 docoment (a read the Matz reply)

I see two solutions:

a)

require 'rexml/document'
require 'iconv'

data = your xml document
data = Iconv.conv("UTF-8","ISO-8859-1",data)
doc = REXML::Document.new(data)

you must convert the data of doc from utf-8 to iso befor use it

username = Iconv.conv("ISO-8859-1","UTF-8",doc. )

b) change the encoding of the xml

require 'rexml/document'

data = your xml document
data = data.gsub(/encoding="utf-8"/i, 'encoding="iso-8859-1"')
doc = REXML::Document.new(data)

Anyway a) and b) don't work if the document contains valid utf-8 chars
not in ascii-7 (for example latin letters è, ò, etc.)

--
Posted viahttp://www.ruby-forum.com/.

Topic		Replies	Views
REXML::Document could not parse UTF-8 "<name>\302</name>" ruby-talk	4	150	6 January 2008
Regexps and Unicode ruby-talk	0	82	18 March 2004
UTF in Regexp ruby-talk	1	84	3 February 2007
XMLRPC (REXML) incorrectly handles UTF-8 data ruby-talk	6	137	18 November 2010
RegEx Unicode Character ruby-talk	1	112	29 December 2011

Regexp for matching UTF-8 characters without close tag

Related topics