Is there a library out there that let’s me parse HTML and use XPath
expressions against it? What is it?
Thanks
···
–
David Corbin dcorbin@machturtle.com
Is there a library out there that let’s me parse HTML and use XPath
expressions against it? What is it?
Thanks
–
David Corbin dcorbin@machturtle.com
On Mon, 13 Oct 2003, David Corbin wrote:
REXML (http://www.germane-software.com/software/rexml/) and
HTML Parser2 (http://www.bike-nomad.com/ruby/)
…work well for me.
Chad
Are you saying you parse it with HTML Parser2, and then use the XPath support
out of the REXML?
On Sunday 12 October 2003 17:36, Chad Fowler wrote:
On Mon, 13 Oct 2003, David Corbin wrote:
Is there a library out there that let’s me parse HTML and use XPath
expressions against it? What is it?
Thanks
REXML (http://www.germane-software.com/software/rexml/) and
HTML Parser2 (http://www.bike-nomad.com/ruby/)
–
David Corbin dcorbin@machturtle.com
On Mon, 13 Oct 2003, David Corbin wrote:
Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:
parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(’*/table’).each do |element|
end
Chad
Chad Fowler wrote:
Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(‘*/table’).each do |element|do something with element
end
I take it the need for putting ruby-htmltools in the middle is that generally HTML isn’t clean XML? So, I take it the tools do things like turn “
” int “
” and stick “
Could be very useful for a number of things!
Harry O.
And if you get:
/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in initialize': undefined method
add’ for #HTMLTree::Element:0x40331a58 (NameError)
from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in initialize' from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
new’
from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
as_rexml_document' from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in
as_rexml_document’
would you attribute that to a) Bad HTML, b) library version mismatch, or c)
something else?
On Sunday 12 October 2003 20:21, Chad Fowler wrote:
On Mon, 13 Oct 2003, David Corbin wrote:
On Sunday 12 October 2003 17:36, Chad Fowler wrote:
> On Mon, 13 Oct 2003, David Corbin wrote:
>
> # Is there a library out there that let’s me parse HTML and use XPath
> # expressions against it? What is it?
>
> # Thanks
>
>
>
> REXML (http://www.germane-software.com/software/rexml/) and
> HTML Parser2 (http://www.bike-nomad.com/ruby/)
>
Are you saying you parse it with HTML Parser2, and then use the XPath
support # out of the REXML?
Sort of. I shouldn’t have said “HTML Parser2”. The right name seems to
be ruby-htmltools. It integrates with REXML and allows you to do this:parser = HTMLTree::Parser.new(true, true)
parser.feed(file.readlines.join)
tree = parser.tree.html_node.as_rexml_document
tree.elements.to_a(‘*/table’).each do |element|do something with element
end
Chad
–
David Corbin dcorbin@machturtle.com
On Mon, 13 Oct 2003, David Corbin wrote:
Looks like a library mismatch. I haven’t seen this and I can’t reproduce
it. What was the HTML you were using?
Chad
On Sunday 12 October 2003 20:21, Chad Fowler wrote:
> On Mon, 13 Oct 2003, David Corbin wrote:
>
> # On Sunday 12 October 2003 17:36, Chad Fowler wrote:
> # > On Mon, 13 Oct 2003, David Corbin wrote:
> # >
> # > # Is there a library out there that let’s me parse HTML and use
XPath # > # > # expressions against it? What is it?
> # >
> # > # Thanks
> # >
> # >
> # >
> # > REXML (http://www.germane-software.com/software/rexml/) and
> # > HTML Parser2 (http://www.bike-nomad.com/ruby/)
> # >
>
> # Are you saying you parse it with HTML Parser2, and then use the XPath
> support # out of the REXML?
>
> Sort of. I shouldn’t have said “HTML Parser2”. The right name seems
to # > be ruby-htmltools. It integrates with REXML and allows you to do
this: # >> parser = HTMLTree::Parser.new(true, true)
> parser.feed(file.readlines.join)
> tree = parser.tree.html_node.as_rexml_document
> tree.elements.to_a(‘*/table’).each do |element|
> # do something with element
> end
>
> Chad
And if you get:
/usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize’: undefined
method `add’ for #HTMLTree::Element:0x40331a58 (NameError)
from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in
initialize' # from /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
new’ # from
/usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in`as_rexml_document’
from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in
`as_rexml_document’
would you attribute that to a) Bad HTML, b) library version mismatch, or
c) # something else?
Looks like a library mismatch. I haven’t seen this and I can’t reproduce
it. What was the HTML you were using?
Something I’m trying to screen-scrape from an on-line dictionary. It’s not
really well formed. If you like, I’ll send it to you off-line.
It’s hard to be sure, but it looks like Rexml is either 2.5.6 or 2.7.1 (I’m
leaning toward the latter) The htmltools are the latest found on the site
you cited.
On Monday 13 October 2003 07:14, Chad Fowler wrote:
On Mon, 13 Oct 2003, David Corbin wrote:
Chad
–
David Corbin dcorbin@machturtle.com
# On Monday 13 October 2003 07:14, Chad Fowler wrote:
# > On Mon, 13 Oct 2003, David Corbin wrote:
# >
# > # On Sunday 12 October 2003 20:21, Chad Fowler wrote:
# > # > On Mon, 13 Oct 2003, David Corbin wrote:
# > # >
# > # > # On Sunday 12 October 2003 17:36, Chad Fowler wrote:
# > # > # > On Mon, 13 Oct 2003, David Corbin wrote:
# > # > # >
# > # > # > # Is there a library out there that let's me parse HTML and use
# > XPath # > # > # expressions against it? What is it?
# > # > # > #
# > # > # > # Thanks
# > # > # > #
# > # > # > #
# > # > # >
# > # > # > REXML (http://www.germane-software.com/software/rexml/) and
# > # > # > HTML Parser2 (http://www.bike-nomad.com/ruby/)
# > # > # >
# > # > #
# > # > # Are you saying you parse it with HTML Parser2, and then use the XPath
# > # > support # out of the REXML?
# > # > #
# > # > Sort of. I shouldn't have said "HTML Parser2". The right name seems
# > to # > be ruby-htmltools. It integrates with REXML and allows you to do
# > this: # >
# > # > parser = HTMLTree::Parser.new(true, true)
# > # > parser.feed(file.readlines.join)
# > # > tree = parser.tree.html_node.as_rexml_document
# > # > tree.elements.to_a('*/table').each do |element|
# > # > # do something with element
# > # > end
# > # >
# > # > Chad
# > #
# > # And if you get:
# > # /usr/local/lib/site_ruby/1.6/rexml/child.rb:21:in `initialize': undefined
# > # method `add' for #<HTMLTree::Element:0x40331a58> (NameError)
# > # from /usr/local/lib/site_ruby/1.6/rexml/comment.rb:23:in
# > `initialize' # from
# > /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in `new' # from
# > /usr/local/lib/site_ruby/1.6/html/xpath.rb:50:in
# > # `as_rexml_document'
# > # from /usr/local/lib/site_ruby/1.6/html/xpath.rb:36:in
# > # `as_rexml_document'
# > #
# > # would you attribute that to a) Bad HTML, b) library version mismatch, or
# > c) # something else?
# > #
# >
# >
# > Looks like a library mismatch. I haven't seen this and I can't reproduce
# > it. What was the HTML you were using?
# >
On Mon, 13 Oct 2003, David Corbin wrote:
#
# Something I'm trying to screen-scrape from an on-line dictionary. It's not
# really well formed. If you like, I'll send it to you off-line.
#
# It's hard to be sure, but it looks like Rexml is either 2.5.6 or 2.7.1 (I'm
# leaning toward the latter) The htmltools are the latest found on the site
# you cited.
Go ahead and send it to me. I think it's going to end up being a library
version issue but let's just see.
Chad