Hi all,
Using Ruby (and REXML) to parse a directory full of HTML files mixed
with Spanish but mostly English.
For the most part, Ruby can correctly parse the HTML. That's all fine
and dandy, BUT, when there's a unicode character NEAR an HTML entity,
the parser bombs with:
Missing end tag for 'em' (got "p")
Or
Missing end tag for 'em' (got "em")
Or
Missing end tag for 'em' (got "body")
etc, etc., etc...
The errors are all related based on how/where the unicode characters are
placed throughout the documents.
Here's an example of what the parser does NOT like and throws errors
like what I posted above:
<em>Ministerio Público de la Federación</em>
But just for fun, I changed the string to be this:
<em>Ministerio Público de la Federaciónnn</em>
...and it loads/parses, no issue.
What I have noticed is that when there's a unicode character SO CLOSE to
the begin/start tag =, i.e., <em>, </em>, etc., etc... The parser
bombs. But if I move the unicode character around FURTHER inside the
tags, the parser loads the data without error.
What's going on?
I can't really post all the code due to company privacy. But here's a
listed of the requireds if it at all helps:
require 'rubygems'
require 'sqlite3'
require 'rexml/document'
require 'rexml/streamlistener'
include REXML
require 'zlib'
require 'CGI'
require 'osx/cocoa'
include OSX
And here's some parsing if at all helps, I know, I know, probably
not...:
@title << CGI::unescapeHTML(content.strip)
title.gsub('&','&').gsub('<','<').gsub('>','>').gsub(''','\'').gsub('"','"').gsub('§','§')
@body << " #{attr_name}=\"#{attr_value}\""
@body << "</#{name}>"
···
--
Posted via http://www.ruby-forum.com/.