Decent HTML Parser?

Kevin_Weller · 11 July 2006 19:55

Anybody have experience with a decent HTML parser for a Ruby application? I've looked around, and so far everything I've found is either unfinished, unstable, [relatively] undocumented, or just plain ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an easily-traversable data structure, in the same order that the elements appear in the file. I don't want or need a callback mechanism, only something I can iterate and tree-search. Though I don't hold much hope it will work, I will try using REXML on my text and see what it produces...results to be posted here. Thanks in advance!

···

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

Bruno_Celeste1 · 11 July 2006 20:00

You can check rubyful soup library at

···

On 7/11/06, Kevin Weller <"http://www.itcrucible.com/contact"@ruby-lang.org> wrote:

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

Kenosis · 11 July 2006 21:45

To help us narrow things down, can you tell us which Ruby HTML mods
you've tried out?

Ken

Kevin Weller wrote:

···

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

7rans · 11 July 2006 22:06

Kevin Weller wrote:

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

I also metion Facets' tagiterator.rb (from Ⴗnyasu's tagiter.rb) For
example:

  a = TagIterator.new(stext)
  a.first("body") do |y|
    y.nth("dl",2) do |dl|
      dl.enumtag("dt") do |t|
        puts t.text.strip
      end
    end
  end

http://facets.rubyforge.org/api/more/classes/TagIterator.html

T.

Geoff_Davis · 11 July 2006 22:15

You might find Rubyful Soup useful:

It's a bit slow, but it is quite robust to bad HTML.

···

On Tue, 11 Jul 2006 13:52:13 -0600, Kevin Weller wrote:

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

Assaf · 12 July 2006 08:10

Kevin,

I settled on using Tidy to clean up the HTML, then parsing it into a
tree using the HTML scanner that comes with Rails.

Tidy does all the hard stuff of dealing with bad HTML and straightening
it up. The HTML scanner is very lightweight and has a simple, clean
API. You don't need to run Rails, just require the scanner library
(look for html/document.rb).

It's two passes, but with Tidy being C++ and HTML scanner doing no
cleanup, it's amazingly fast. I'm processing around 500Kb/s (mobile Duo
Core 1.8GHz).

You can walk the DOM, or use XPath-like finders, or my preferred method
of looking up content: using CSS selectors.

If you're doing HTML scraping this library will do all the hard work
for you:
http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/

Assaf

Kevin Weller wrote:

···

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

Seth_Thomas_Rasmusse · 12 July 2006 13:10

It was just released recently, but Hpricot might be worth a look:

http://redhanded.hobix.com/inspect/okayGiveHpricot02AGo.html

Kevin Weller wrote:

···

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

Rob · 13 July 2006 16:39

If you find the Ruby options don't cut it, then I recommend TagSoup
which is implemented in Java but "keeps on trucking" to convert HTML
found in the wild to well-formed XML files:
http://home.ccil.org/~cowan/XML/tagsoup/

I use it when screenscraping, then load the resulting XML files with REXML.

Rob

···

On 7/11/06, Kevin Weller <"http://www.itcrucible.com/contact"@ruby-lang.org> wrote:

Anybody have experience with a decent HTML parser for
a Ruby application? I've looked around, and so far
everything I've found is
either unfinished, unstable, [relatively] undocumented,
or just plain ugly in terms of API.

I'd like a parser that can take a partial HTML file and
return an easily-traversable data structure, in the same
order that the elements appear in the file.

Alex_Young · 11 July 2006 20:10

Bruno Celeste wrote:

You can check rubyful soup library at
Rubyful Soup: "The brush has got entangled in it!"

... and if that doesn't help, Tidy + REXML does fine for me.

···

--
Alex

On 7/11/06, Kevin Weller > <"http://www.itcrucible.com/contact"@ruby-lang.org> wrote:

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

I'd like a parser that can take a partial HTML file and return an
easily-traversable data structure, in the same order that the elements
appear in the file. I don't want or need a callback mechanism, only
something I can iterate and tree-search. Though I don't hold much hope
it will work, I will try using REXML on my text and see what it
produces...results to be posted here. Thanks in advance!

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

Phillip_Hutchings · 11 July 2006 21:51

I'll second Rubyful Soup. It's not the fastest, but it tolerates bad
HTML. I used it for an intranet spider, it worked with anything I
could find.

···

On 7/12/06, Bruno Celeste <bruno.celeste@gmail.com> wrote:

You can check rubyful soup library at
Rubyful Soup: "The brush has got entangled in it!"

--
Phillip Hutchings
http://www.sitharus.com/

Kevin_Weller · 12 July 2006 00:50

Thanks for the reply. I've basically reviewed every potential match generated by:

http://raa.ruby-lang.org/search.rhtml?search=html+parser

I've since tried out a couple, and the option that seems to work best so far is Ned Konz' ruby-htmltools. Unfortunately, it does not seem to parse partial HTML documents well, so I've had to resort to parsing the whole thing, extracting a REXML document object, then using XPath to get to the content I care about. Seems like a waste of processing power when I can get the necessary markup in text with a simple file.grep operation and (theoretically) parse only the text that I want, but at least I have something that works until/unless something better comes along. Any recommendations?

Kenosis wrote:

···

To help us narrow things down, can you tell us which Ruby HTML mods
you've tried out?

Ken

Kevin Weller wrote:

Anybody have experience with a decent HTML parser for a Ruby
application? I've looked around, and so far everything I've found is
either unfinished, unstable, [relatively] undocumented, or just plain
ugly in terms of API.

Kevin_Weller · 12 July 2006 00:55

Geoff Davis wrote:

···

On Tue, 11 Jul 2006 13:52:13 -0600, Kevin Weller wrote:

Anybody have experience with a decent HTML parser for a Ruby application? I've looked around, and so far everything I've found is either unfinished, unstable, [relatively] undocumented, or just plain ugly in terms of API.
--- snip ---

You might find Rubyful Soup useful:

Rubyful Soup: "The brush has got entangled in it!"

It's a bit slow, but it is quite robust to bad HTML.

Ooooh, thanks, that might be just what the doctor ordered...especially if it handles a single text line of an HTML document. Right now I have a temporary solution that involves using ruby-htmltools to parse the entire document, then finding the part that I want with an XPath query. However, Rubyful Soup might turn out to be a better performer if it does what I want. Thanks so much!

Pere_Noel1 · 12 July 2006 10:10

except it doesn't wipe out the charset even when it's wrong (afaik)

···

assaf.arkin@gmail.com <assaf.arkin@gmail.com> wrote:

Tidy does all the hard stuff of dealing with bad HTML and straightening
it up.

--
une bévue

Kevin_Weller · 13 July 2006 15:25

That sounds like a very clean solution for future needs. I dig into Rails NEXT week for an upcoming Rails-based project (this one isn't)...nice to know what my options will be then. Thanks for the info!

···

assaf.arkin@gmail.com wrote:

Kevin,

I settled on using Tidy to clean up the HTML, then parsing it into a
tree using the HTML scanner that comes with Rails.

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

Kevin_Weller · 13 July 2006 15:30

Seth Thomas Rasmussen wrote:

It was just released recently, but Hpricot might be worth a look:

http://redhanded.hobix.com/inspect/okayGiveHpricot02AGo.html

Thanks Seth! Hpricot looks a little young, but promising. It apparently parses incomplete HTML documents, don't know how it handles imbalanced tags. I'll check it out...thanks again...

···

--
Kevin Weller
Information Technology Crucible
http://www.itcrucible.com

Ezra_Zygmuntowicz · 11 July 2006 22:06

Have a look at Hpricot, _why's new ruby/C html parser. Its fast and has nice features.

http://redhanded.hobix.com/inspect/okayGiveHpricot02AGo.html

-Ezra

···

On Jul 11, 2006, at 2:51 PM, Phillip Hutchings wrote:

On 7/12/06, Bruno Celeste <bruno.celeste@gmail.com> wrote:

You can check rubyful soup library at
Rubyful Soup: "The brush has got entangled in it!"

I'll second Rubyful Soup. It's not the fastest, but it tolerates bad
HTML. I used it for an intranet spider, it worked with anything I
could find.
-- Phillip Hutchings
http://www.sitharus.com/

Alex_Young · 12 July 2006 10:58

Une bévue wrote:

···

assaf.arkin@gmail.com <assaf.arkin@gmail.com> wrote:

Tidy does all the hard stuff of dealing with bad HTML and straightening
it up.

except it doesn't wipe out the charset even when it's wrong (afaik)

I've had it generate an incorrect xml processing instruction once before, and that's a simple gsub! to fix.

--
Alex

why_the_lucky_stiff1 · 13 July 2006 17:58

=> #<Hpricot::Doc
 {elem
 
 {elem {text "A test of "} {elem {text "unbalanced"}} }
 {text " tags."}
 {bogusetag }}>

Hpricot will attempt to match tags, keeping around the unmatched end tags as
BogusEtag objects (as HTree does.) When you output XHTML, these bogus tags are
ignored and you'll get back XHTML (though some work still has to be done to
be sure it's truly valid XHTML.)

Thankfully, I've been getting some really mangled HTML from interested people
sent to my inbox. So today's Hpricot has become much better at fixing all
manner of nutty tags. If you encounter anything Hpricot can't do, send me an
e-mail and with the URL you're trying to parse and the results you expect.

_why

···

On Fri, Jul 14, 2006 at 12:30:04AM +0900, Kevin Weller wrote:

Thanks Seth! Hpricot looks a little young, but promising. It
apparently parses incomplete HTML documents, don't know how it handles
imbalanced tags. I'll check it out...thanks again...

>> pp Hpricot("A test of unbalanced tags.")

Topic		Replies	Views
Html parses ruby-talk	2	76	8 December 2005
Decent HTML Parser? ruby-talk	0	73	12 July 2006
Where is HTML::XMLParser? ruby-talk	3	62	27 June 2006
HTML parsing ruby-talk	4	82	2 February 2004
Ruby (X)HTML Parser? ruby-talk	5	90	25 September 2006

Decent HTML Parser?

Related topics