XPath and HTML

David_Corbin1 · 12 October 2003 20:07

Is there a library out there that let’s me parse HTML and use XPath
expressions against it? What is it?

Thanks

···

–
David Corbin dcorbin@machturtle.com

Chad_Fowler · 12 October 2003 21:36

Is there a library out there that let’s me parse HTML and use XPath

expressions against it? What is it?

···

On Mon, 13 Oct 2003, David Corbin wrote:

Thanks

REXML (http://www.germane-software.com/software/rexml/) and
HTML Parser2 (http://www.bike-nomad.com/ruby/)

…work well for me.

Chad

David_Corbin1 · 12 October 2003 23:45

Are you saying you parse it with HTML Parser2, and then use the XPath support
out of the REXML?

···

On Sunday 12 October 2003 17:36, Chad Fowler wrote:

On Mon, 13 Oct 2003, David Corbin wrote:

Is there a library out there that let’s me parse HTML and use XPath

expressions against it? What is it?

Thanks

REXML (http://www.germane-software.com/software/rexml/) and
HTML Parser2 (http://www.bike-nomad.com/ruby/)

–
David Corbin dcorbin@machturtle.com

Chad_Fowler · 13 October 2003 00:21

On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # Is there a library out there that let’s me parse HTML and use XPath

> # expressions against it? What is it?

>

> # Thanks

>

> REXML (http://www.germane-software.com/software/rexml/) and

> HTML Parser2 (http://www.bike-nomad.com/ruby/)

>

···

On Mon, 13 Oct 2003, David Corbin wrote:

Are you saying you parse it with HTML Parser2, and then use the XPath support

out of the REXML?

Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:

parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(’*/table’).each do |element|

do something with element

end

Chad

Harry_Ohlsen1 · 13 October 2003 00:36

Chad Fowler wrote:

Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:

parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(‘*/table’).each do |element|

do something with element

end

I take it the need for putting ruby-htmltools in the middle is that generally HTML isn’t clean XML? So, I take it the tools do things like turn “
” int “
” and stick “

” at the end of paragraphs, that sort of thing?

Could be very useful for a number of things!

Harry O.

David_Corbin1 · 13 October 2003 01:11

And if you get:
/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in initialize': undefined method add’ for #HTMLTree::Element:0x40331a58 (NameError)
from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in initialize' from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in new’
from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
as_rexml_document' from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in as_rexml_document’

would you attribute that to a) Bad HTML, b) library version mismatch, or c)
something else?

···

On Sunday 12 October 2003 20:21, Chad Fowler wrote:

On Mon, 13 Oct 2003, David Corbin wrote:

On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # Is there a library out there that let’s me parse HTML and use XPath

> # expressions against it? What is it?

>

> # Thanks

>

>

>

> REXML (http://www.germane-software.com/software/rexml/) and

> HTML Parser2 (http://www.bike-nomad.com/ruby/)

>

Are you saying you parse it with HTML Parser2, and then use the XPath

support # out of the REXML?

Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:

parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(‘*/table’).each do |element|

do something with element

end

Chad

–
David Corbin dcorbin@machturtle.com

Chad_Fowler · 13 October 2003 11:14

On Sunday 12 October 2003 20:21, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> # > On Mon, 13 Oct 2003, David Corbin wrote:

> # >

> # > # Is there a library out there that let’s me parse HTML and use XPath

> # > # expressions against it? What is it?

> # >

> # > # Thanks

> # >

> # > REXML (http://www.germane-software.com/software/rexml/) and

> # > HTML Parser2 (http://www.bike-nomad.com/ruby/)

> # >

>

> # Are you saying you parse it with HTML Parser2, and then use the XPath

> support # out of the REXML?

>

> Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to

> be ruby-htmltools. It integrates with REXML and allows you to do this:

>

> parser = HTMLTree::Parser.new(true, true)

> parser.feed(file.readlines.join)

> tree = parser.tree.html_node.as_rexml_document

> tree.elements.to_a(’*/table’).each do |element|

> # do something with element

> end

>

> Chad

···

On Mon, 13 Oct 2003, David Corbin wrote:

And if you get:

/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize’: undefined

method `add’ for #HTMLTree::Element:0x40331a58 (NameError)

from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in `initialize’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in `new’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in

`as_rexml_document’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in

`as_rexml_document’

would you attribute that to a) Bad HTML, b) library version mismatch, or c)

something else?

Looks like a library mismatch. I haven’t seen this and I can’t reproduce
it. What was the HTML you were using?

Chad

David_Corbin1 · 13 October 2003 23:36

On Sunday 12 October 2003 20:21, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> # > On Mon, 13 Oct 2003, David Corbin wrote:

> # >

> # > # Is there a library out there that let’s me parse HTML and use

XPath # > # > # expressions against it? What is it?

> # >

> # > # Thanks

> # >

> # >

> # >

> # > REXML (http://www.germane-software.com/software/rexml/) and

> # > HTML Parser2 (http://www.bike-nomad.com/ruby/)

> # >

>

> # Are you saying you parse it with HTML Parser2, and then use the XPath

> support # out of the REXML?

>

> Sort of. I shouldn’t have said “HTML Parser2”. The right name seems

to # > be ruby-htmltools. It integrates with REXML and allows you to do
this: # >

> parser = HTMLTree::Parser.new(true, true)

> parser.feed(file.readlines.join)

> tree = parser.tree.html_node.as_rexml_document

> tree.elements.to_a(‘*/table’).each do |element|

> # do something with element

> end

>

> Chad

And if you get:

/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize’: undefined

method `add’ for #HTMLTree::Element:0x40331a58 (NameError)

from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in

initialize' # from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in new’ # from
/usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in

`as_rexml_document’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in

`as_rexml_document’

would you attribute that to a) Bad HTML, b) library version mismatch, or

c) # something else?

Looks like a library mismatch. I haven’t seen this and I can’t reproduce
it. What was the HTML you were using?

Something I’m trying to screen-scrape from an on-line dictionary. It’s not
really well formed. If you like, I’ll send it to you off-line.

It’s hard to be sure, but it looks like Rexml is either 2.5.6 or 2.7.1 (I’m
leaning toward the latter) The htmltools are the latest found on the site
you cited.

···

On Monday 13 October 2003 07:14, Chad Fowler wrote:

On Mon, 13 Oct 2003, David Corbin wrote:

Chad

–
David Corbin dcorbin@machturtle.com

Chad_Fowler · 13 October 2003 23:58

# On Monday 13 October 2003 07:14, Chad Fowler wrote:
# > On Mon, 13 Oct 2003, David Corbin wrote:
# >
# > # On Sunday 12 October 2003 20:21, Chad Fowler wrote:
# > # > On Mon, 13 Oct 2003, David Corbin wrote:
# > # >
# > # > # On Sunday 12 October 2003 17:36, Chad Fowler wrote:
# > # > # > On Mon, 13 Oct 2003, David Corbin wrote:
# > # > # >
# > # > # > # Is there a library out there that let's me parse HTML and use
# > XPath # > # > # expressions against it? What is it?
# > # > # > #
# > # > # > # Thanks
# > # > # > #
# > # > # > #
# > # > # >
# > # > # > REXML (http://www.germane-software.com/software/rexml/) and
# > # > # > HTML Parser2 (http://www.bike-nomad.com/ruby/)
# > # > # >
# > # > #
# > # > # Are you saying you parse it with HTML Parser2, and then use the XPath
# > # > support # out of the REXML?
# > # > #
# > # > Sort of. I shouldn't have said "HTML Parser2". The right name seems
# > to # > be ruby-htmltools. It integrates with REXML and allows you to do
# > this: # >
# > # > parser = HTMLTree::Parser.new(true, true)
# > # > parser.feed(file.readlines.join)
# > # > tree = parser.tree.html_node.as_rexml_document
# > # > tree.elements.to_a('*/table').each do |element|
# > # > # do something with element
# > # > end
# > # >
# > # > Chad
# > #
# > # And if you get:
# > # /usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize': undefined
# > # method `add' for #<HTMLTree::Element:0x40331a58> (NameError)
# > # from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in
# > `initialize' # from
# > /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in `new' # from
# > /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
# > # `as_rexml_document'
# > # from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in
# > # `as_rexml_document'
# > #
# > # would you attribute that to a) Bad HTML, b) library version mismatch, or
# > c) # something else?
# > #
# >
# >
# > Looks like a library mismatch. I haven't seen this and I can't reproduce
# > it. What was the HTML you were using?
# >

···

On Mon, 13 Oct 2003, David Corbin wrote:

#
# Something I'm trying to screen-scrape from an on-line dictionary. It's not
# really well formed. If you like, I'll send it to you off-line.
#
# It's hard to be sure, but it looks like Rexml is either 2.5.6 or 2.7.1 (I'm
# leaning toward the latter) The htmltools are the latest found on the site
# you cited.

Go ahead and send it to me. I think it's going to end up being a library
version issue but let's just see.

Chad

Topic		Replies	Views
HTML Parser suggestions wanted ruby-talk	12	127	4 June 2002
HTML dom ruby-talk	8	101	25 June 2009
From a URL to XPath 2.0 ruby-talk	1	122	21 February 2008
HTML parsing ruby-talk	4	82	2 February 2004
XML parser ruby-talk	7	111	20 July 2008