Hpricot scraping returns nil

Sergei_Maertens · 20 November 2008 21:00

Good evening

First I'll mention I have used the search function and found some useful
topics, but I still don't really find a solution due to a lack of Ruby
and Hpricot/Xpath knowlegde.

The problem is the following: from
http://users.telenet.be/weerstation.drongen/index.htm/Current_Vantage_Pro.htm
I need to scrape the temperature and Today's Rain values (need those for
Engineering Project). With Xpather and Firebug I looked up the Xpath to
the Temperature values:
/html/body/table/tbody/tr[3]/td[2]/font/strong/small/font (as Xpather
says so).

But when I try to print the value in Ruby, I got nil.

Here is my code:

···

---------------------------------------------------------------------------
#!/usr/bin/ruby
require 'rubygems'
require 'open-uri'
require 'hpricot'

@url="http://users.telenet.be/weerstation.drongen/index.htm/Current_Vantage_Pro.htm"
xpath = "/html/body/table/tbody/tr[3]/td[2]/font/strong/small/font"
@response=""

begin
  open(@url) {|file|
  puts "Fetched Document: #{file.base_uri}"
  @response = file.read
  }

  doc = Hpricot(@response)
  puts (doc/"#{xpath}").inner_html
rescue Exception => e
  puts e
end

---------------------------------------------------------------------------

Since this returned nil, I decided to look up where I got nil returned.
Apparently /html/body/table/tbody is too far, because /html/body/table
still returns an output and tbody returns nil.

I've read that I should try to rebuild the path now, but I really don't
find a way how to do this. This is only my second serious Ruby script
(only the beginning actually) and the first time I used Hpricot.

I'm looking forward to replies, and I'm sorry to bother you with yet
another Hpricot-nil topic, but I'm kinda hopeless because of my
deadline...

Kind regards,
Sergei
--
Posted via http://www.ruby-forum.com/.

Jn_Jacob · 21 November 2008 00:31

It should work if you take the tbody off the xpath. I have read
somewhere that tbody does not work for hpricot , I dont know Y .
Gudluck.
xpath = "/html/body/table//tr[3]/td[2]/font/strong/small/font"

···

--
Posted via http://www.ruby-forum.com/.

Peter_Szinek3 · 21 November 2008 08:42

It should work if you take the tbody off the xpath. I have read
somewhere that tbody does not work for hpricot , I dont know Y .
Gudluck.
xpath = "/html/body/table//tr[3]/td[2]/font/strong/small/font"
-- Posted via http://www.ruby-forum.com/\.

There is more to it than "tbody does not work for hpricot".

When a HTML parser (Firefox and Hpricot in this case) parses a HTML page, it has to build a tree from it (a.k.a. DOM).
The problem is that a lot (most?) of the HTML out there is badly formatted, so the process of DOM building is very ambiguous (what if tags are not nested properly? tags that are never closed? and a lot of other problems) so every parser approaches it a bit differently (that's one reason why you have the 'works in IE but not in FF' kind of problems), and e.g. Firefox even makes some efforts to make the parsed HTML standards compliant - for example inserting a tbody tag after a table tag if it's missing.

However, this is but only very small difference between how Hpricot and Firefox parses the HTML/builds the DOM tree (on which XPaths are evaluated) - Hpricot tries to be as close to FF as possible, but this doesn't always happen (though _why said he considers these cases bugs).

Bottom line: you can't expect that XPath yanked from FireBug will work with Hpricot/Mechanize (though it mostly does, and adding a tbody increases your chances even further).

Cheers,
Peter

···

___
http://www.rubyrailways.com
http://scrubyt.org

Sergei_Maertens · 21 November 2008 11:34

Jn Jacob wrote:

It should work if you take the tbody off the xpath. I have read
somewhere that tbody does not work for hpricot , I dont know Y .
Gudluck.
xpath = "/html/body/table//tr[3]/td[2]/font/strong/small/font"

I'll try it in a minute, thank you for the answer.

@Peter, thank you for the very complete explanation.

···

--
Posted via http://www.ruby-forum.com/\.

Sergei_Maertens · 21 November 2008 12:30

Sergei Maertens wrote:

I'll try it in a minute, thank you for the answer.

and it does work! Thank you very much Jn Jakob
Now I only have to solve the '�' that appears instead of '°'.

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Hpricot syntax different from Xpath? ruby-talk	13	135	19 December 2007
Hpricot and xpath doesn't work like they should ?!? ruby-talk	5	94	30 July 2007
Hpricot and xpath ruby-talk	9	140	13 August 2008
[QUIZ] Posix Pangrams (#97) ruby-talk	0	63	6 October 2006
Need help with Hpricot ruby-talk	2	91	9 October 2008

Hpricot scraping returns nil

Related topics