HTML dom

Hi,

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML, but that seems to be a bit too strict and any deviation
from XML causes the engine to throw an error and quit.

Is there a way to make REXML more permissive or is there another library
that does HTML DOM and XPath?

···

--
Posted via http://www.ruby-forum.com/.

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML, but that seems to be a bit too strict and any deviation
from XML causes the engine to throw an error and quit.

Is there a way to make REXML more permissive

No.

or is there another library
that does HTML DOM and XPath?

Nokogiri and Hpricot seem to be the most popular.

Cheers

  robert

···

On 23.06.2009 17:55, Victor Tanvuia wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Hi,

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML

Yes, REXML can be awkward if you're used to using the DOM. IMHO.

Is there a way to make REXML more permissive or is there another library

There's libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

Have you tried using REXML's SAX2 parser? I think it would be better
suited for your problem.

-Skye

···

On Jun 23, 8:55 am, Victor Tanvuia <victor.tanv...@tantanprod.com> wrote:

Thanks for the help. I've decided to go for Hipricot and it works rather
well now. Don't know why but for some reason I was reluctant to go for
that. Anyway it's great... I love it. It feels like jQuery :slight_smile:

···

--
Posted via http://www.ruby-forum.com/.

Hi,

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML

Yes, REXML can be awkward if you're used to using the DOM. IMHO.

Why do you say that? REXML provides an XML DOM in similar ways as other XML libs. You can even use XPath queries.

Is there a way to make REXML more permissive or is there another library

There's libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

libxml won't help as Victor is not processing XML.

Have you tried using REXML's SAX2 parser? I think it would be better
suited for your problem.

No, his problem is that he used an XML tool to process HTML. While many web pages are valid XML not all are due to the history of browser development. Thus it's better to use a tool suited to the job, i.e. capable of parsing HTML which is not valid XML.

Kind regards

  robert

···

On 23.06.2009 19:24, Skye Shaw!@#$ wrote:

On Jun 23, 8:55 am, Victor Tanvuia <victor.tanv...@tantanprod.com> > wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

The libxml2 c library has contained a correcting HTML processor since it's
first release in April 2000. libxml2 is quite capable of processing
broken HTML.

libxml-ruby and nokogiri both provide a ruby API for libxml2.

···

On Wed, Jun 24, 2009 at 05:50:47AM +0900, Robert Klemme wrote:

On 23.06.2009 19:24, Skye Shaw!@#$ wrote:

On Jun 23, 8:55 am, Victor Tanvuia <victor.tanv...@tantanprod.com> >> wrote:

Hi,

I'm trying to build a HTML page indexer in ruby and I'd like to be able
to use DOM and or XPath on a document. The application is currently
using REXML

Yes, REXML can be awkward if you're used to using the DOM. IMHO.

Why do you say that? REXML provides an XML DOM in similar ways as other
XML libs. You can even use XPath queries.

Is there a way to make REXML more permissive or is there another library

There's libxml bindings for Ruby, but I recall that library missing
getElementsByTagName and getElementsById. Though it does have a method
to query the DOM via Xpath.

libxml won't help as Victor is not processing XML.

Have you tried using REXML's SAX2 parser? I think it would be better
suited for your problem.

No, his problem is that he used an XML tool to process HTML. While many
web pages are valid XML not all are due to the history of browser
development. Thus it's better to use a tool suited to the job, i.e.
capable of parsing HTML which is not valid XML.

--
Aaron Patterson
http://tenderlovemaking.com/

Whoa... and right after you recommended the libxml-based Nokogiri.

I have been using libxml2 (in various forms) for years to parse HTML.
I find it to be the best HTML parser out there. It's also completely
XPath 1.0 compliant--my XPaths tend to break in Hpricot.

Both libxml-ruby and Nokogiri have similar functionality. I like the
Nokogiri API a little better.

-- Mark.

···

On Jun 23, 4:48 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

libxml won't help as Victor is not processing XML.

>> Hi,

>> I'm trying to build a HTML page indexer in ruby and I'd like to be able
>> to use DOM and or XPath on a document. The application is currently
>> using REXML

> Yes, REXML can be awkward if you're used to using the DOM. IMHO.

Why do you say that? REXML provides an XML DOM in similar ways as other
XML libs. You can even use XPath queries.

Not sure what you mean by similar. Similar in that there is a tree of
elements that can be manipulated, but not similar to anything called
DOM.

In REXML, an Element is an REXML::Element; which is a REXML::Parent
which is a REXML::Child (huh?) which includes REXML::Node.
There is no NodeList, createTextNode(), getElementById(), etc...

To get an element by its ID, I'd have to say something like:

my_document.root.elements("//@id['crap']").each { #do something with
crap }

I would have liked to been able to use the DOM when using REXML,
unfortunately REXML doesn't really support it.

>> Is there a way to make REXML more permissive or is there another library

> There's libxml bindings for Ruby, but I recall that library missing
> getElementsByTagName and getElementsById. Though it does have a method
> to query the DOM via Xpath.

libxml won't help as Victor is not processing XML.

That should be fine.

> Have you tried using REXML's SAX2 parser? I think it would be better
> suited for your problem.

No, his problem is that he used an XML tool to process HTML.

Your right. He should never have been using REXML.

-Skye

···

On Jun 23, 1:48 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

On 23.06.2009 19:24,SkyeShaw!@#$ wrote:
> On Jun 23, 8:55 am, Victor Tanvuia <victor.tanv...@tantanprod.com> > > wrote:

:-} Sorry, I did not knew that Nokogiri was based on libxml. Thanks to you and Aaron for the update! Skye seemed to suggest XML tools only which are clearly not suited for the job. I'll shut up now.

Kind regards

  robert

···

On 24.06.2009 00:25, Mark Thomas wrote:

On Jun 23, 4:48 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

libxml won't help as Victor is not processing XML.

Whoa... and right after you recommended the libxml-based Nokogiri.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/