Anybody have experience with a decent HTML parser for a Ruby application? I've looked around, and so far everything I've found is either unfinished, unstable, [relatively] undocumented, or just plain ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an easily-traversable data structure, in the same order that the elements appear in the file. I don't want or need a callback mechanism, only something I can iterate and tree-search. Though I don't hold much hope it will work, I will try using REXML on my text and see what it produces...results to be posted here. Thanks in advance!
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!
To help us narrow things down, can you tell us which Ruby HTML mods
you've tried out?
Ken
Kevin Weller wrote:
···
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!
I also metion Facets' tagiterator.rb (from Ⴗnyasu's tagiter.rb) For
example:
a = TagIterator.new(stext)
a.first("body") do |y|
y.nth("dl",2) do |dl|
dl.enumtag("dt") do |t|
puts t.text.strip
end
end
end
It's a bit slow, but it is quite robust to bad HTML.
···
On Tue, 11 Jul 2006 13:52:13 -0600, Kevin Weller wrote:
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!
I settled on using Tidy to clean up the HTML, then parsing it into a
tree using the HTML scanner that comes with Rails.
Tidy does all the hard stuff of dealing with bad HTML and straightening
it up. The HTML scanner is very lightweight and has a simple, clean
API. You don't need to run Rails, just require the scanner library
(look for html/document.rb).
It's two passes, but with Tidy being C++ and HTML scanner doing no
cleanup, it's amazingly fast. I'm processing around 500Kb/s (mobile Duo
Core 1.8GHz).
You can walk the DOM, or use XPath-like finders, or my preferred method
of looking up content: using CSS selectors.
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!
If you find the Ruby options don't cut it, then I recommend TagSoup
which is implemented in Java but "keeps on trucking" to convert HTML
found in the wild to well-formed XML files: http://home.ccil.org/~cowan/XML/tagsoup/
I use it when screenscraping, then load the resulting XML files with REXML.
Anybody have experience with a decent HTML parser for
a Ruby application? I've looked around, and so far
everything I've found is
either unfinished, unstable, [relatively] undocumented,
or just plain ugly in terms of API.
I'd like a parser that can take a partial HTML file and
return an easily-traversable data structure, in the same
order that the elements appear in the file.
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!
I've since tried out a couple, and the option that seems to work best so far is Ned Konz' ruby-htmltools. Unfortunately, it does not seem to parse partial HTML documents well, so I've had to resort to parsing the whole thing, extracting a REXML document object, then using XPath to get to the content I care about. Seems like a waste of processing power when I can get the necessary markup in text with a simple file.grep operation and (theoretically) parse only the text that I want, but at least I have something that works until/unless something better comes along. Any recommendations?
Kenosis wrote:
···
To help us narrow things down, can you tell us which Ruby HTML mods
you've tried out?
Ken
Kevin Weller wrote:
Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.
On Tue, 11 Jul 2006 13:52:13 -0600, Kevin Weller wrote:
Anybody have experience with a decent HTML parser for a Ruby application? I've looked around, and so far everything I've found is either unfinished, unstable, [relatively] undocumented, or just plain ugly in terms of API.
--- snip ---
It's a bit slow, but it is quite robust to bad HTML.
Ooooh, thanks, that might be just what the doctor ordered...especially if it handles a single text line of an HTML document. Right now I have a temporary solution that involves using ruby-htmltools to parse the entire document, then finding the part that I want with an XPath query. However, Rubyful Soup might turn out to be a better performer if it does what I want. Thanks so much!
That sounds like a very clean solution for future needs. I dig into Rails NEXT week for an upcoming Rails-based project (this one isn't)...nice to know what my options will be then. Thanks for the info!
···
assaf.arkin@gmail.com wrote:
Kevin,
I settled on using Tidy to clean up the HTML, then parsing it into a
tree using the HTML scanner that comes with Rails.
Thanks Seth! Hpricot looks a little young, but promising. It apparently parses incomplete HTML documents, don't know how it handles imbalanced tags. I'll check it out...thanks again...
I'll second Rubyful Soup. It's not the fastest, but it tolerates bad
HTML. I used it for an intranet spider, it worked with anything I
could find.
-- Phillip Hutchings http://www.sitharus.com/
=> #<Hpricot::Doc
{elem
<p>
{elem <b> {text "A test of "} {elem <i> {text "unbalanced"}} </b>}
{text " tags."}
{bogusetag </i>}}>
Hpricot will attempt to match tags, keeping around the unmatched end tags as
BogusEtag objects (as HTree does.) When you output XHTML, these bogus tags are
ignored and you'll get back XHTML (though some work still has to be done to
be sure it's truly valid XHTML.)
Thankfully, I've been getting some really mangled HTML from interested people
sent to my inbox. So today's Hpricot has become much better at fixing all
manner of nutty tags. If you encounter anything Hpricot can't do, send me an
e-mail and with the URL you're trying to parse and the results you expect.
_why
···
On Fri, Jul 14, 2006 at 12:30:04AM +0900, Kevin Weller wrote:
Thanks Seth! Hpricot looks a little young, but promising. It
apparently parses incomplete HTML documents, don't know how it handles
imbalanced tags. I'll check it out...thanks again...
>> pp Hpricot("<p><b>A test of <i>unbalanced</b> tags.</i>")