Ruby, Unicode, and HTML Entities Problem

Mr_Peepers · 26 September 2010 15:54

Hi all,

Using Ruby (and REXML) to parse a directory full of HTML files mixed
with Spanish but mostly English.

For the most part, Ruby can correctly parse the HTML. That's all fine
and dandy, BUT, when there's a unicode character NEAR an HTML entity,
the parser bombs with:

Missing end tag for 'em' (got "p")
Or
Missing end tag for 'em' (got "em")
Or
Missing end tag for 'em' (got "body")
etc, etc., etc...

The errors are all related based on how/where the unicode characters are
placed throughout the documents.

Here's an example of what the parser does NOT like and throws errors
like what I posted above:
Ministerio Público de la Federación

But just for fun, I changed the string to be this:
Ministerio Público de la Federaciónnn

...and it loads/parses, no issue.

What I have noticed is that when there's a unicode character SO CLOSE to
the begin/start tag =, i.e., , , etc., etc... The parser
bombs. But if I move the unicode character around FURTHER inside the
tags, the parser loads the data without error.

What's going on?

I can't really post all the code due to company privacy. But here's a
listed of the requireds if it at all helps:
require 'rubygems'
require 'sqlite3'
require 'rexml/document'
require 'rexml/streamlistener'
include REXML
require 'zlib'
require 'CGI'
require 'osx/cocoa'
include OSX

And here's some parsing if at all helps, I know, I know, probably
not...:
@title << CGI::unescapeHTML(content.strip)
title.gsub('&','&').gsub('<','<').gsub('>','>').gsub(''','\'').gsub('"','"').gsub('§','§')

@body << " #{attr_name}=\"#{attr_value}\""
@body << "</#{name}>"

···

--
Posted via http://www.ruby-forum.com/.

Markus_Fischer · 26 September 2010 18:51

No one wants your company data, but above you pasted a problematic
snippet. Based on this, can't you create a minimal test case from it?

I was having similar problems with Nokogiri until I figured I need to
set encoding manually because the automatic detection failed for some
reason. But it's premature to suggest anything without a smaller/real
test case.

- Markus

···

On 26.09.2010 17:54, Mr Peepers wrote:

Here's an example of what the parser does NOT like and throws errors
like what I posted above:
Ministerio Público de la Federación

But just for fun, I changed the string to be this:
Ministerio Público de la Federaciónnn

...and it loads/parses, no issue.

I can't really post all the code due to company privacy.

Mr_Peepers · 26 September 2010 19:08

Ok, get the Ruby and a sample of the data attempting to be loaded here,
www.khourys.com/Archive.rar.

Regarding 150340.html, as just an example.

I'll walk you through my troubleshooting quickly.

1) If I replace this text (where it fails):
Ministerio Público de la Federación

with this:
Ministerio Público de la Federaciónnn

FILE LOADS!

2) So I'm thinking, did this one o-acute get corrupted? Nope. But, if
you remove this sentence:
Whenever the Federal Public Ministry (Ministerio Público de la
Federación) investigates the activities of organized crime members
that deal with goods of illicit origin, the investigation is carried out
with the assistance of the Secretariat of Finance and Public Credit
(Secretaría de la Hacienda y Crédito Público).

FILE LOADS!

3) I replaced this:
Ministerio Público de la Federación
with this
Ministerio Público de la Federación
(removed the 's)

FILE LOADS!

Now I'm thinking, hhhmmmm... (actually, WTF) Could the unicode
character so close to the end HTML tag be causing the issue,
potentially?

4) To test my theory, I'll modify these lines together in the file:

Before:
Ministerio Público de la Federación
Secretaría de la Hacienda y Crédito Público

After
Ministerio Público de la Federaciónnnn <-- this to prove it's
not JUST this line
Secretaría de la Hacienda y Crédito Pú <-- this to prove that
when a unicode character is so close to an HTML entity - we're screwed.

FILE BOMBED!

Think we have the culprit?

···

--
Posted via http://www.ruby-forum.com/.

Markus_Fischer · 26 September 2010 20:10

Hi,

Ok, get the Ruby and a sample of the data attempting to be loaded here,
www.khourys.com/Archive.rar.

Regarding 150340.html, as just an example.

Not exactly a small test case, the 12kb load.rb . I was more talking
along the line of you providing a simple script to show a reproducible
error.

Anyway, I simply fired up rexml to do basic thing but it bombs
immediately. Disclaimer: I never used rexml before:

$ ruby -rrexml/document -e 'REXML::Document.new File.new("150340.html")'
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parseexception.rb:31:in
`gsub': invalid byte sequence in UTF-8 (ArgumentError)
 from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parseexception.rb:31:in
`to_s'
 from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in
`message'
 from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:95:in
`rescue in parse'
 from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/parsers/treeparser.rb:20:in
`parse'
 from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/document.rb:230:in
`build'
 from
/home/mfischer/.rvm/rubies/ruby-1.9.2-p0/lib/ruby/1.9.1/rexml/document.rb:43:in
`initialize'
 from -e:1:in `new'
 from -e:1:in `<main>'

Anyway, the same using Nokogiri works for me:

$ ruby -rnokogiri -e 'puts Nokogiri::HTML( File.open("150340.html")
)/"//p/em[3]"'
Ministerio Público de la Federación

Does that help you? If not, you should provide your *small* rexml test
case which bombs.

HTH,
- Markus

···

On 26.09.2010 21:08, Mr Peepers wrote:

Mr_Peepers · 26 September 2010 20:14

Not really... Think we figured out the problem. UTF-8 is a variable
length encoding. We seem to have a character in there that's indicating
further encoding but it's not so it pukes. Somewhere along the line the
encoding of the actual file got messed up.

Need to close this topic.

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
REXML: parsing a string with unescaped ampersand entities ruby-talk	7	133	25 August 2009
REXML::Document could not parse UTF-8 "<name>\302</name>" ruby-talk	4	150	6 January 2008
REXML Input File Question ruby-talk	7	106	28 July 2010
REXML & HTMLentities incorrectly map to UTF-8 ruby-talk	12	155	5 November 2012
Rexml difficulties ruby-talk	9	86	21 October 2006

Ruby, Unicode, and HTML Entities Problem

Related topics