Hpricot syntax different from Xpath?

Celine · 18 December 2007 22:04

Hi all

I'm trying to parse a page with Hpricot in order to retrieve a value.

I use Xpather (a firefox extension) in order to get the path of this
value. But when I use this path with Hpricot, it doesn't work. I have
to change it so that it works.

Here's my path, given by Xpather :

/html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
div[1]/table/tbody/tr[1]

And here's what I have to write in order to make it understand by
Hpricot :

/html/body/div/div/div/div/div/div/div/div/div/div/table/tr

Could you explain me why I have to write that ?

Thanks in advance

Chris_Shea · 18 December 2007 22:21

Well, it depends. It'd be helpful to see the page you're working with.
You might want to try asking the Hpricot mailing list as well (To
join: Send a message to hpricot@code.whytheluckystiff.net Cc:
why@whytheluckystiff.net).

Chris

···

On Dec 18, 2007 3:04 PM, Celine <xhanrot@gmail.com> wrote:

Hi all

I'm trying to parse a page with Hpricot in order to retrieve a value.

I use Xpather (a firefox extension) in order to get the path of this
value. But when I use this path with Hpricot, it doesn't work. I have
to change it so that it works.

Here's my path, given by Xpather :

/html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
div[1]/table/tbody/tr[1]

And here's what I have to write in order to make it understand by
Hpricot :

/html/body/div/div/div/div/div/div/div/div/div/div/table/tr

Could you explain me why I have to write that ?

Thomas_Wieczorek · 18 December 2007 22:25

HPricot doesn't include the whole XPath syntax. You can write a little
function which translates XPath expressions with brackets to HPricot
expressions. I wrote a function for that, but my SVN is right now down
and I can't get it. Drop an answer if you still need it
Firefox includes some missing HTML tags. I ran in it, when I had to
write a little script. <tbody> is added in <table>, probably some more
things, but I didn't find them. You can see the difference when you
download the page with open-uri. Not many pages add <tbody>.

···

On Dec 18, 2007 11:04 PM, Celine <xhanrot@gmail.com> wrote:

Hi all

I'm trying to parse a page with Hpricot in order to retrieve a value.

I use Xpather (a firefox extension) in order to get the path of this
value. But when I use this path with Hpricot, it doesn't work. I have
to change it so that it works.

Here's my path, given by Xpather :

/html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
div[1]/table/tbody/tr[1]

Celine · 18 December 2007 22:34

Hi Chris, thanks for your answer
Here is the page I'm working with : http://finance.yahoo.com
I want to retrieve value of Nasdaq (up left of the page).
I sent the same message on Hpricot ML this afternoon, actually no
answer.

···

On Dec 18, 11:21 pm, Chris Shea <ch...@ruby.tie-rack.org> wrote:

Well, it depends. It'd be helpful to see the page you're working with.
You might want to try asking the Hpricot mailing list as well (To
join: Send a message to hpri...@code.whytheluckystiff.net Cc:
w...@whytheluckystiff.net).

Chris

Celine · 18 December 2007 22:39

Hi Thomas

I'm very interested in your function. Do you know where I can find
differences between XPath syntax and Hpricot syntax ? What reference
did you use to write your function ?

Celine

···

On Dec 18, 11:25 pm, Thomas Wieczorek <wieczo...@googlemail.com> wrote:

On Dec 18, 2007 11:04 PM, Celine <xhan...@gmail.com> wrote:

> Hi all

> I'm trying to parse a page with Hpricot in order to retrieve a value.

> I use Xpather (a firefox extension) in order to get the path of this
> value. But when I use this path with Hpricot, it doesn't work. I have
> to change it so that it works.

> Here's my path, given by Xpather :

> /html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/div[1]/
> div[1]/table/tbody/tr[1]

HPricot doesn't include the whole XPath syntax. You can write a little
function which translates XPath expressions with brackets to HPricot
expressions. I wrote a function for that, but my SVN is right now down
and I can't get it. Drop an answer if you still need it
Firefox includes some missing HTML tags. I ran in it, when I had to
write a little script. <tbody> is added in <table>, probably some more
things, but I didn't find them. You can see the difference when you
download the page with open-uri. Not many pages add <tbody>.

Chris_Shea · 18 December 2007 22:43

>
> Well, it depends. It'd be helpful to see the page you're working with.
> You might want to try asking the Hpricot mailing list as well (To
> join: Send a message to hpri...@code.whytheluckystiff.net Cc:
> w...@whytheluckystiff.net).
>
> Chris

Hi Chris, thanks for your answer
Here is the page I'm working with : http://finance.yahoo.com
I want to retrieve value of Nasdaq (up left of the page).

I see. It's pretty easy to get using element attributes, which
resilient to page changes. If Yahoo decides to add a div in the
hierarchy, or add a new exchange, this shouldn't suddenly fail:

# assuming doc is the Hpricot object for finance.yahoo.com
doc.at('tr[@title="Nasdaq"]/td[2]')

Looking at the page source, the span element that contains the value
actually has an id (yfs_l10_^ixic), but it doesn't look stable, does
it?

I sent the same message on Hpricot ML this afternoon, actually no
answer.

I'm on the Hpricot ML and never saw it. Maybe a hiccup somewhere?

HTH,
Chris

···

On Dec 18, 2007 3:34 PM, Celine <xhanrot@gmail.com> wrote:

On Dec 18, 11:21 pm, Chris Shea <ch...@ruby.tie-rack.org> wrote:

Thomas_Wieczorek · 18 December 2007 23:35

I'm very interested in your function.

I'll post it as soon as the SVN server is up again.

Do you know where I can find
differences between XPath syntax and Hpricot syntax ? What reference
did you use to write your function ?

I used http://code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions
and related pages to get started with it. I found the table/tbody
thing because I didn't get any further with it and thought, that I did
something wrong until I downloaded the page without Firefox using
open-uri.

···

On Dec 18, 2007 11:39 PM, Celine <xhanrot@gmail.com> wrote:

On Dec 18, 11:25 pm, Thomas Wieczorek <wieczo...@googlemail.com> > wrote:

Vitor_P · 19 December 2007 09:53

Hi, Celine.

I know it's not nearly as fun as screen-scraping, but you can get the value
for Nasdaq (and many other quotes) on Yahoo! Finance by querying the right
URL for the CSV. The current value can be obtained by fetching:

http://download.finance.yahoo.com/d/quotes.csv?s=[name]&f=sl1d1t1c1ohgv&e=.csv

You just have replace [name] with %5EIXIC for Nasdaq. Historical data is
available (closings only) at:

http://ichart.finance.yahoo.com/table.csv?&s=\[<http://ichart.finance.yahoo.com/table.csv?&s=[quote>name\]&a=\[start
month]&b=[start_day]&c=[start
_year]&d=[end_month]&e=[end_day]&f=[end_year]&g=d&ignore=.csv

Just replace [name] with the index or stock you wish to query and each
bracketed date info with integers.

I've replied to a topic before that involved Yahoo! Finance, but it was
specifically about searching for a symbol. Since it's not your case, here's
hoping that directly fetching it will suffice.

···

On Dec 18, 2007 8:34 PM, Celine <xhanrot@gmail.com> wrote:

Hi Chris, thanks for your answer
Here is the page I'm working with : http://finance.yahoo.com
I want to retrieve value of Nasdaq (up left of the page).
I sent the same message on Hpricot ML this afternoon, actually no
answer.

--
Vitor Peres (dodecaphonic)
------------------------------------
http://twitter.com/dodecaphonic

Chris_Shea · 18 December 2007 22:54

I take that back. That id is almost definitely stable.

doc.at('span[@id="yfs_l10_^ixic"]')

Chris

···

> On Dec 18, 11:21 pm, Chris Shea <ch...@ruby.tie-rack.org> wrote:
Looking at the page source, the span element that contains the value
actually has an id (yfs_l10_^ixic), but it doesn't look stable, does
it?

Celine · 19 December 2007 21:45

Hi Victor, thank you very much
But, as you said, it isn't very funny, no ?
(but I didn't know that trick, thanks)

···

On 19 déc, 10:53, Vitor Peres <dodecapho...@gmail.com> wrote:

I know it's not nearly as fun as screen-scraping, but you can get the value
for Nasdaq (and many other quotes) on Yahoo! Finance by querying the right
URL for the CSV. The current value can be obtained by fetching:

http://download.finance.yahoo.com/d/quotes.csv?s=[name]&f=sl1d1t1c1ohgv&e=.csv

You just have replace [name] with %5EIXIC for Nasdaq. Historical data is
available (closings only) at:

http://ichart.finance.yahoo.com/table.csv?&s=\[<http://ichart.finance.yahoo.com/table.csv?&s=[quote>name\]&a=\[start
month]&b=[start_day]&c=[start
_year]&d=[end_month]&e=[end_day]&f=[end_year]&g=d&ignore=.csv

Just replace [name] with the index or stock you wish to query and each
bracketed date info with integers.

I've replied to a topic before that involved Yahoo! Finance, but it was
specifically about searching for a symbol. Since it's not your case, here's
hoping that directly fetching it will suffice.

--
Vitor Peres (dodecaphonic)
------------------------------------http://twitter.com/dodecaphonic

Celine · 19 December 2007 21:45

Yes, thanks, it runs.
There's something I can't understand : in the Xpath expression I
posted later, when a node has several child DIVs, I access them with
an index (div[2]...), but in Hpricot syntax, DIVs aren't accessed
using an index. So, what trick Hpricot uses to locate "the good" div ?

···

On 18 déc, 23:54, Chris Shea <ch...@ruby.tie-rack.org> wrote:

> > On Dec 18, 11:21 pm, Chris Shea <ch...@ruby.tie-rack.org> wrote:
> Looking at the page source, the span element that contains the value
> actually has an id (yfs_l10_^ixic), but it doesn't look stable, does
> it?

I take that back. That id is almost definitely stable.

doc.at('span[@id="yfs_l10_^ixic"]')

Chris

Chris_Shea · 19 December 2007 22:30

I'm not sure I understand. Hpricot certainly can access elements that way:

doc = Hpricot('<body><div>one</div><div>two</div></body>')

doc.at('body/div[1]').inner_text # => "one"
doc.at('body/div[2]').inner_text # => "two"
doc.at('body/div:eq(0)').inner_text # => "one"
doc.at('body/div:eq(1)').inner_text # => "two"

Chris

···

On Dec 19, 2007 2:45 PM, Celine <xhanrot@gmail.com> wrote:

Yes, thanks, it runs.
There's something I can't understand : in the Xpath expression I
posted later, when a node has several child DIVs, I access them with
an index (div[2]...), but in Hpricot syntax, DIVs aren't accessed
using an index. So, what trick Hpricot uses to locate "the good" div ?

Celine · 19 December 2007 23:10

Look :

doc = Hpricot(open("http://finance.yahoo.com"))

(Xpath syntax with DIVs indexed, given by XPather)

doc.at('html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/
div[1]/div[1]/table/tr[3]/td[2]/span').inner_text
=> NoMethodError: undefined method `inner_text' for nil:NilClass

(without indices for DIVs)

doc.at('html/body/div/div/div/div/div/div/div/div/div/div/table/tr[3]/
td[2]/span').inner_text
=> "2,601.01"

So, why ?

Celine

···

On 19 déc, 23:30, Chris Shea <ch...@ruby.tie-rack.org> wrote:

I'm not sure I understand. Hpricot certainly can access elements that way:

doc = Hpricot('<body><div>one</div><div>two</div></body>')

doc.at('body/div[1]').inner_text # => "one"
doc.at('body/div[2]').inner_text # => "two"
doc.at('body/div:eq(0)').inner_text # => "one"
doc.at('body/div:eq(1)').inner_text # => "two"

Chris

Chris_Shea · 19 December 2007 23:25

At some point the path you're using fails. That's why. You could check
node by node, going one level lower each time to see where you start
getting nil from your search. And then you could see what you need to
do to fix the path. That's what I just did:

Now you look:

XPATH = 'html/body/div[1]/div[2]/div[2]/div[2]/div[1]/div[2]/div[1]/div[1]/div[1]/div[1]/table/tr[3]/td[2]/span'
doc = Hpricot(open('http://finance.yahoo.com/'\))
doc.at(XPATH).inner_text # => "2,601.01"

Tools like Xpather and Firebug can give you paths, but they're not
going to work all the time. But, as I said before, there's a span with
an id attribute that lets you pluck the data without worrying about a
full path, so this is sort of moot.

HTH,
Chris

···

On Dec 19, 2007 4:10 PM, Celine <xhanrot@gmail.com> wrote:

Look :

doc = Hpricot(open("http://finance.yahoo.com"))

(Xpath syntax with DIVs indexed, given by XPather)

doc.at('html/body/div[1]/div[2]/div[1]/div[2]/div[1]/div[2]/div[1]/div/
div[1]/div[1]/table/tr[3]/td[2]/span').inner_text
=> NoMethodError: undefined method `inner_text' for nil:NilClass

(without indices for DIVs)

doc.at('html/body/div/div/div/div/div/div/div/div/div/div/table/tr[3]/
td[2]/span').inner_text
=> "2,601.01"

So, why ?

Topic		Replies	Views
Hpricot and xpath doesn't work like they should ?!? ruby-talk	5	119	30 July 2007
Hpricot and xpath ruby-talk	9	180	13 August 2008
Hpricot scraping returns nil ruby-talk	4	107	21 November 2008
Help with HTML parsing ruby-talk	12	172	5 November 2009
Hpricot problem ruby-talk	10	116	18 December 2006

Hpricot syntax different from Xpath?

Related topics