Nokogiri not pulling correct XPath

Scott_B1 · 28 February 2011 09:28

Hi everyone,

I was wondering if anyone could help me. I'm trying to pull text from a
website using nokogiri and not all the text is not being pulled into my
variables through XPath.

I have used Firebug (Firefox extension) to pull the correct XPath from
the page so I'm thinking it should be correct. So far, I have:

variable1 =
(doc/"/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2").inner_html

variable 2 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong").inner_html

variable 3 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]").inner_html

Now, variable1 is working but I can't get any values out of variable2 or
variable3. Is there a different syntax I should be using? To test, I've
only been outputting to the cli but I want to eventually push these into
a sqlite3 database.

Anyone have any ideas?
Cheers.

Scott.

···

--
Posted via http://www.ruby-forum.com/.

Luis_G · 28 February 2011 10:50

Hello...

I've been using Nokogiri for a while and I never had problems with it.
It works great.

I have some questions for you... Why do you put the full path to the h2
tag?
The h2 has a class or an id defined? how about all the div in between,
they have class or id defined?

I'm asking that because you can access inner_html of an html tag like
this:

doc.xpath("//div[@class='(class of the div here)']/h2").each do |node|
var = node.inner.html
end

You don't really need to put the full path to the html tag. You can also
use //div[@id='(id of the div here), for example.

Probably the other variables are not working because you missed a div or
something else in between... I think the way I show in lines above is
easy to get the html content without making mistakes.

If you want just let me know the url you want to get the content and
I'll build a small script to do that.

Regards,

Luis Goncalves

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 28 February 2011 12:20

First I would dump the page _as loaded by your program_ (this is
important) to disk and verify that those XPaths do work independently
(e.g. with Firefox's DOM Inspector or Eclipse XML tools).

Kind regards

robert

···

On Mon, Feb 28, 2011 at 10:28 AM, Scott B. <sdbarlow@gmail.com> wrote:

I was wondering if anyone could help me. I'm trying to pull text from a
website using nokogiri and not all the text is not being pulled into my
variables through XPath.

I have used Firebug (Firefox extension) to pull the correct XPath from
the page so I'm thinking it should be correct. So far, I have:

variable1 =
(doc/"/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2").inner_html

variable 2 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong").inner_html

variable 3 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]").inner_html

Now, variable1 is working but I can't get any values out of variable2 or
variable3. Is there a different syntax I should be using? To test, I've
only been outputting to the cli but I want to eventually push these into
a sqlite3 database.

Anyone have any ideas?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Eric_Christopherson · 28 February 2011 15:06

Hi everyone,

I was wondering if anyone could help me. I'm trying to pull text from a
website using nokogiri and not all the text is not being pulled into my
variables through XPath.

I have used Firebug (Firefox extension) to pull the correct XPath from
the page so I'm thinking it should be correct. So far, I have:

variable1 =
(doc/"/html/body/div[2]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div/div/h2").inner_html

variable 2 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong").inner_html

variable 3 =
(doc/"/html/body/div[3]/div[7]/div[4]/div[3]/div[6]/div/div/div/div/div/div[2]/table/tbody/tr/td[2]/strong[2]").inner_html

Now, variable1 is working but I can't get any values out of variable2 or
variable3.

In my experience, Firebug shows a tbody element as part of the xpath,
even if there is no actual tbody tag in the HTML. In that case,
Nokogiri will fail to find the right element unless you take out the
'tbody/'.

···

On Mon, Feb 28, 2011 at 3:28 AM, Scott B. <sdbarlow@gmail.com> wrote:

Is there a different syntax I should be using? To test, I've
only been outputting to the cli but I want to eventually push these into
a sqlite3 database.

Anyone have any ideas?
Cheers.

Scott_B1 · 1 March 2011 22:51

Thanks guys for the help. In the end, I think it had more to do with the
tbody than anything. I still couldn't get it working with Xpath however,
so used CSS and was able to get it working that way (albeit in a round
about fashion using an array).

Cheers.

Scott.

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
Extracting some text from HTML ruby-talk	2	142	2 November 2010
Nokogiri help parsing HTML ruby-talk	17	509	29 March 2013
Help missing something BASIC ruby-talk	10	98	21 October 2010
Scraping with Nokogiri for dynamic page(?) ruby-talk	2	150	14 June 2012
Nokogiri bug? ruby-talk	2	95	18 August 2010

Nokogiri not pulling correct XPath

Related topics