XPath and HTML

Is there a library out there that let’s me parse HTML and use XPath
expressions against it? What is it?

Thanks

···


David Corbin dcorbin@machturtle.com

Is there a library out there that let’s me parse HTML and use XPath

expressions against it? What is it?

···

On Mon, 13 Oct 2003, David Corbin wrote:

Thanks

REXML (http://www.germane-software.com/software/rexml/) and
HTML Parser2 (http://www.bike-nomad.com/ruby/)

…work well for me.

Chad

Are you saying you parse it with HTML Parser2, and then use the XPath support
out of the REXML?

···

On Sunday 12 October 2003 17:36, Chad Fowler wrote:

On Mon, 13 Oct 2003, David Corbin wrote:

Is there a library out there that let’s me parse HTML and use XPath

expressions against it? What is it?

Thanks

REXML (http://www.germane-software.com/software/rexml/) and
HTML Parser2 (http://www.bike-nomad.com/ruby/)


David Corbin dcorbin@machturtle.com

On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # Is there a library out there that let’s me parse HTML and use XPath

> # expressions against it? What is it?

>

> # Thanks

>

>

>

> REXML (http://www.germane-software.com/software/rexml/) and

> HTML Parser2 (http://www.bike-nomad.com/ruby/)

>

···

On Mon, 13 Oct 2003, David Corbin wrote:

Are you saying you parse it with HTML Parser2, and then use the XPath support

out of the REXML?

Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:

parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(’*/table’).each do |element|

do something with element

end

Chad

Chad Fowler wrote:

Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:

parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(‘*/table’).each do |element|

do something with element

end

I take it the need for putting ruby-htmltools in the middle is that generally HTML isn’t clean XML? So, I take it the tools do things like turn “
” int “
” and stick “

” at the end of paragraphs, that sort of thing?

Could be very useful for a number of things!

Harry O.

And if you get:
/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in initialize': undefined method add’ for #HTMLTree::Element:0x40331a58 (NameError)
from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in initialize' from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in new’
from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
as_rexml_document' from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in as_rexml_document’

would you attribute that to a) Bad HTML, b) library version mismatch, or c)
something else?

···

On Sunday 12 October 2003 20:21, Chad Fowler wrote:

On Mon, 13 Oct 2003, David Corbin wrote:

On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # Is there a library out there that let’s me parse HTML and use XPath

> # expressions against it? What is it?

>

> # Thanks

>

>

>

> REXML (http://www.germane-software.com/software/rexml/) and

> HTML Parser2 (http://www.bike-nomad.com/ruby/)

>

Are you saying you parse it with HTML Parser2, and then use the XPath

support # out of the REXML?

Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:

parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(‘*/table’).each do |element|

do something with element

end

Chad


David Corbin dcorbin@machturtle.com

On Sunday 12 October 2003 20:21, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> # > On Mon, 13 Oct 2003, David Corbin wrote:

> # >

> # > # Is there a library out there that let’s me parse HTML and use XPath

> # > # expressions against it? What is it?

> # >

> # > # Thanks

> # >

> # >

> # >

> # > REXML (http://www.germane-software.com/software/rexml/) and

> # > HTML Parser2 (http://www.bike-nomad.com/ruby/)

> # >

>

> # Are you saying you parse it with HTML Parser2, and then use the XPath

> support # out of the REXML?

>

> Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to

> be ruby-htmltools. It integrates with REXML and allows you to do this:

>

> parser = HTMLTree::Parser.new(true, true)

> parser.feed(file.readlines.join)

> tree = parser.tree.html_node.as_rexml_document

> tree.elements.to_a(’*/table’).each do |element|

> # do something with element

> end

>

> Chad

···

On Mon, 13 Oct 2003, David Corbin wrote:

And if you get:

/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize’: undefined

method `add’ for #HTMLTree::Element:0x40331a58 (NameError)

from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in `initialize’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in `new’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in

`as_rexml_document’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in

`as_rexml_document’

would you attribute that to a) Bad HTML, b) library version mismatch, or c)

something else?

Looks like a library mismatch. I haven’t seen this and I can’t reproduce
it. What was the HTML you were using?

Chad

On Sunday 12 October 2003 20:21, Chad Fowler wrote:

> On Mon, 13 Oct 2003, David Corbin wrote:

>

> # On Sunday 12 October 2003 17:36, Chad Fowler wrote:

> # > On Mon, 13 Oct 2003, David Corbin wrote:

> # >

> # > # Is there a library out there that let’s me parse HTML and use

XPath # > # > # expressions against it? What is it?

> # >

> # > # Thanks

> # >

> # >

> # >

> # > REXML (http://www.germane-software.com/software/rexml/) and

> # > HTML Parser2 (http://www.bike-nomad.com/ruby/)

> # >

>

> # Are you saying you parse it with HTML Parser2, and then use the XPath

> support # out of the REXML?

>

> Sort of. I shouldn’t have said “HTML Parser2”. The right name seems

to # > be ruby-htmltools. It integrates with REXML and allows you to do
this: # >

> parser = HTMLTree::Parser.new(true, true)

> parser.feed(file.readlines.join)

> tree = parser.tree.html_node.as_rexml_document

> tree.elements.to_a(‘*/table’).each do |element|

> # do something with element

> end

>

> Chad

And if you get:

/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize’: undefined

method `add’ for #HTMLTree::Element:0x40331a58 (NameError)

from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in

initialize' # from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in new’ # from
/usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in

`as_rexml_document’

from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in

`as_rexml_document’

would you attribute that to a) Bad HTML, b) library version mismatch, or

c) # something else?

Looks like a library mismatch. I haven’t seen this and I can’t reproduce
it. What was the HTML you were using?

Something I’m trying to screen-scrape from an on-line dictionary. It’s not
really well formed. If you like, I’ll send it to you off-line.

It’s hard to be sure, but it looks like Rexml is either 2.5.6 or 2.7.1 (I’m
leaning toward the latter) The htmltools are the latest found on the site
you cited.

···

On Monday 13 October 2003 07:14, Chad Fowler wrote:

On Mon, 13 Oct 2003, David Corbin wrote:

Chad


David Corbin dcorbin@machturtle.com

# On Monday 13 October 2003 07:14, Chad Fowler wrote:
# > On Mon, 13 Oct 2003, David Corbin wrote:
# >
# > # On Sunday 12 October 2003 20:21, Chad Fowler wrote:
# > # > On Mon, 13 Oct 2003, David Corbin wrote:
# > # >
# > # > # On Sunday 12 October 2003 17:36, Chad Fowler wrote:
# > # > # > On Mon, 13 Oct 2003, David Corbin wrote:
# > # > # >
# > # > # > # Is there a library out there that let's me parse HTML and use
# > XPath # > # > # expressions against it? What is it?
# > # > # > #
# > # > # > # Thanks
# > # > # > #
# > # > # > #
# > # > # >
# > # > # > REXML (http://www.germane-software.com/software/rexml/) and
# > # > # > HTML Parser2 (http://www.bike-nomad.com/ruby/)
# > # > # >
# > # > #
# > # > # Are you saying you parse it with HTML Parser2, and then use the XPath
# > # > support # out of the REXML?
# > # > #
# > # > Sort of. I shouldn't have said "HTML Parser2". The right name seems
# > to # > be ruby-htmltools. It integrates with REXML and allows you to do
# > this: # >
# > # > parser = HTMLTree::Parser.new(true, true)
# > # > parser.feed(file.readlines.join)
# > # > tree = parser.tree.html_node.as_rexml_document
# > # > tree.elements.to_a('*/table').each do |element|
# > # > # do something with element
# > # > end
# > # >
# > # > Chad
# > #
# > # And if you get:
# > # /usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize': undefined
# > # method `add' for #<HTMLTree::Element:0x40331a58> (NameError)
# > # from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in
# > `initialize' # from
# > /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in `new' # from
# > /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
# > # `as_rexml_document'
# > # from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in
# > # `as_rexml_document'
# > #
# > # would you attribute that to a) Bad HTML, b) library version mismatch, or
# > c) # something else?
# > #
# >
# >
# > Looks like a library mismatch. I haven't seen this and I can't reproduce
# > it. What was the HTML you were using?
# >

···

On Mon, 13 Oct 2003, David Corbin wrote:

#
# Something I'm trying to screen-scrape from an on-line dictionary. It's not
# really well formed. If you like, I'll send it to you off-line.
#
# It's hard to be sure, but it looks like Rexml is either 2.5.6 or 2.7.1 (I'm
# leaning toward the latter) The htmltools are the latest found on the site
# you cited.

Go ahead and send it to me. I think it's going to end up being a library
version issue but let's just see.

Chad