Hpricot and xpath doesn't work like they should ?!?

hi,
I wanted to write me a little console tv-guide with ruby and hpricot. I installed the firefox xpath checker plugin and went to http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEvening.php3 . Then I checked the xpath of these senders fields like ZDF and got:

/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]

so I tried to parse the website for this and output the hits but I don't get any output. Here's the code:

#!/usr/bin/env ruby

$Verbose = true

require 'hpricot'
require 'net/http'

url = URI.parse('http://www.klack.de/TvEvening1.php3?HPTFRAME=%2FTvAtEvening.php3')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|
                               http.request(req)
                                          }

tv = Hpricot(res.body)
tv.search("/html/body/table/tbody/tr[2]/td[2]/table/tbody/tr/td/center/form/table/tbody/tr/td[2]/table[2]/tbody/tr/td/table[2]/tbody/tr[3]/th[1]").each { |a| puts a}

#eof

Am I using hpricot in the wrong way? I thought it could handle xpaths?

···

--
greets

                     one must still have chaos in oneself to be able to give birth to a dancing star

anansi wrote:

Am I using hpricot in the wrong way? I thought it could handle xpaths?

Briefly, I suspect Hpricot uses an XPath subset invented on the fly to permit querying into the HTML node space.

(This isn't a bad thing; the alternative, REXML::XPath, cannot handle some well-formed XHTML [according to Tidy], and certainly can't handle traditional HTML.

(BTW: When I tried to install Hpricot 6 (ruby) on Kubuntu, the require 'hpricot' refused to find it. This might indicate a broken .so file, so I switched to Windows.)

The best way to use XPath is to locate tags by unique id=''. (The page you used abuses the IDs, as CLASSes, so it's ill-formed. But that's not your problem here.)

Don't use long XPath chains (even if an XPath visualizer provides them), because these locate things by incidental features that could change when you hit the page again. Table elements could come and go on the fly.

When I installed that XPath Checker (thanks for pointing it out!) and hit that page, your XPath selects ZDF, so this implicates Hpricot.

Let's find a workaround. If I want to hit, say, "Hotel Zack und Cody", I use Firebug's Inspect Element context menu feature, and see that blurb has a <td title="19:45 Hotel Zack und Cody">. So if I XPath for things like that, we get:

    //td[ @title ]

That sweeps for every td with a title attribute. (The View XPath feature should have an option to find minimal and unique paths based on attributes, not long obsessive paths based on indices.)

And that works in Hpricot, too, to select every cell with a title. Further poking and parsing should get you the raw TV listings.

  tv.search("//td[ @title ]").each{ |a| p a}

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual data feeds available somewhere?

···

--
  Phlip
  Test Driven Ajax (on Rails) [Book]
  "Test Driven Ajax (on Rails)"
  assert_xpath, assert_javascript, & assert_ajax

Phlip wrote:

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual data feeds available somewhere?

thanks for your hint with the id-tags but what you mean with this here? rss-feeds ? I'm not aware of any of them ..

···

--
greets

                     one must still have chaos in oneself to be able to give birth to a dancing star

anansi wrote:

Phlip wrote:

BTW scraping TV guide listings is ... kind'a tacky. Aren't the actual
data feeds available somewhere?

thanks for your hint with the id-tags but what you mean with this here?
rss-feeds ? I'm not aware of any of them ..

That's what I mean - neither am I aware of any. But the TV guide services
get their data from somewhere, and (under the wild assumption that TV
programmers want you to find their shows and watch them) these feeds should
not be proprietary.

But note that electronic TV guides predate RSS...

···

--
Phlip
Test Driven Ajax (on Rails) [Book]
^ assert_xpath
O'Reilly Media - Technology and Business Training <-- assert_raise_message

http://www.klack.de/TvKlackRSS.php

Though there aren't any that fit your bill of "generic evening programming".

···

-----Original Message-----
From: anansi [mailto:kazaam@oleco.net]
Sent: Sunday, July 29, 2007 1:10 PM
To: ruby-talk ML
Subject: Re: hpricot and xpath doesn't work like they should ?!?

Phlip wrote:
> BTW scraping TV guide listings is ... kind'a tacky. Aren't
the actual
> data feeds available somewhere?
thanks for your hint with the id-tags but what you mean with
this here?
rss-feeds ? I'm not aware of any of them ..

--
greets

                     one must still have chaos in oneself to
be able to give birth to a dancing star

Felix Windt wrote:

http://www.klack.de/TvKlackRSS.php

Though there aren't any that fit your bill of "generic evening programming".

yeah I can't find one rss for a generic tv-guide too..

···

--
greets

                     one must still have chaos in oneself to be able to give birth to a dancing star