Hpricot and xpath doesn't work like they should ?!?

anansi · 29 July 2007 18:20

hi,
I wanted to write me a little console tv-guide with ruby and hpricot. I installed the firefox xpath checker plugin and went to http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEvening.php3 . Then I checked the xpath of these senders fields like ZDF and got:

/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]

so I tried to parse the website for this and output the hits but I don't get any output. Here's the code:

#!/usr/bin/env ruby

$Verbose = true

require 'hpricot'
require 'net/http'

url = URI.parse('http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEvening.php3')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
http.request(req)
}

tv = Hpricot(res.body)
tv.search("/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]").each { |a| puts a}

#eof

Am I using hpricot in the wrong way? I thought it could handle xpaths?

···

--
greets

one must still have chaos in oneself to be able to give birth to a dancing star

Phlip1 · 29 July 2007 19:34

anansi wrote:

Am I using hpricot in the wrong way? I thought it could handle xpaths?

Briefly, I suspect Hpricot uses an XPath subset invented on the fly to permit querying into the HTML node space.

(This isn't a bad thing; the alternative, REXML::XPath, cannot handle some well-formed XHTML [according to Tidy], and certainly can't handle traditional HTML.

(BTW: When I tried to install Hpricot 6 (ruby) on Kubuntu, the require 'hpricot' refused to find it. This might indicate a broken .so file, so I switched to Windows.)

The best way to use XPath is to locate tags by unique id=''. (The page you used abuses the IDs, as CLASSes, so it's ill-formed. But that's not your problem here.)

Don't use long XPath chains (even if an XPath visualizer provides them), because these locate things by incidental features that could change when you hit the page again. Table elements could come and go on the fly.

When I installed that XPath Checker (thanks for pointing it out!) and hit that page, your XPath selects ZDF, so this implicates Hpricot.

Let's find a workaround. If I want to hit, say, "Hotel Zack und Cody", I use Firebug's Inspect Element context menu feature, and see that blurb has a <td title="19:45 Hotel Zack und Cody">. So if I XPath for things like that, we get:

//td[ @title ]

That sweeps for every td with a title attribute. (The View XPath feature should have an option to find minimal and unique paths based on attributes, not long obsessive paths based on indices.)

And that works in Hpricot, too, to select every cell with a title. Further poking and parsing should get you the raw TV listings.

tv.search("//td[ @title ]").each{ |a| p a}

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual data feeds available somewhere?

···

--
  Phlip
  Test Driven Ajax (on Rails) [Book]
  "Test Driven Ajax (on Rails)"
  assert_xpath, assert_javascript, & assert_ajax

anansi · 29 July 2007 20:10

Phlip wrote:

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual data feeds available somewhere?

thanks for your hint with the id-tags but what you mean with this here? rss-feeds ? I'm not aware of any of them ..

···

--
greets

one must still have chaos in oneself to be able to give birth to a dancing star

Phlip1 · 29 July 2007 20:10

anansi wrote:

Phlip wrote:

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
data feeds available somewhere?

thanks for your hint with the id-tags but what you mean with this here?
rss-feeds ? I'm not aware of any of them ..

That's what I mean - neither am I aware of any. But the TV guide services
get their data from somewhere, and (under the wild assumption that TV
programmers want you to find their shows and watch them) these feeds should
not be proprietary.

But note that electronic TV guides predate RSS...

···

--
Phlip
Test Driven Ajax (on Rails) [Book]
^ assert_xpath
O'Reilly Media - Technology and Business Training <-- assert_raise_message

Felix_Windt · 29 July 2007 20:32

http://www.klack.de/TvKlackRSS.php

Though there aren't any that fit your bill of "generic evening programming".

···

-----Original Message-----
From: anansi [mailto:kazaam@oleco.net]
Sent: Sunday, July 29, 2007 1:10 PM
To: ruby-talk ML
Subject: Re: hpricot and xpath doesn't work like they should ?!?

Phlip wrote:
> BTW scraping TV guide listings is ... kind'a tacky. Aren't
the actual
> data feeds available somewhere?
thanks for your hint with the id-tags but what you mean with
this here?
rss-feeds ? I'm not aware of any of them ..

--
greets

one must still have chaos in oneself to
be able to give birth to a dancing star

anansi · 30 July 2007 09:20

Felix Windt wrote:

http://www.klack.de/TvKlackRSS.php

Though there aren't any that fit your bill of "generic evening programming".

yeah I can't find one rss for a generic tv-guide too..

···

--
greets

one must still have chaos in oneself to be able to give birth to a dancing star

Topic		Replies	Views
Hpricot syntax different from Xpath? ruby-talk	13	135	19 December 2007
Hpricot scraping returns nil ruby-talk	4	88	21 November 2008
Hpricot and path of an elememt ruby-talk	2	127	11 August 2008
Hpricot and xpath ruby-talk	9	140	13 August 2008
Help with HTML parsing ruby-talk	12	113	5 November 2009

Hpricot and xpath doesn't work like they should ?!?

Related topics