Using Nokogiri

jzakiya · 8 November 2009 18:35

I'm trying to scrape some data off websites using nokogiri

require 'rubygems'
require 'open-uri'
require 'nokogiri' #using the latest 1.4.0

url = 'http://www.whateverwebsitenameis.org'

doc = Nokogiri::HTML(open(url))

This gets me data off the website I want to scrape.

The segment of the site I want looks like this (from FF 'view
source' )

···

-------------------------------------------------------------------------
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

1) <b>Some Institute name</b><Br><br>
2) some address<Br> city, st zip<br>
3)
4) United States <Br>
5)
6) Phone:
7)
8) (123) 456-7890<Br>
9)
10 <br>
11) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<A href="javascript:history.back();">Back to Search Results</

<br><br>

<A href="AssociationSearch.cfm">Search Again</a>

</td>
---------------------------------------------------------------------------------

I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: 'http://www.xyz.org'

I can find the beginning of this section of code by doing this:

doc.css('h2').each do |elem| puts elem.content end
which displays 'Association Detail'

I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific 'Association Detail'
details. I've tried it with 'xpath' and 'search' according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri

but there's something I'm just not getting correctly when I use other
elements get info from.

My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0

Thanks in advance for any help.

7stud · 8 November 2009 23:25

jzakiya wrote:

I'm trying to scrape some data off websites using nokogiri

require 'rubygems'
require 'open-uri'
require 'nokogiri' #using the latest 1.4.0

url = 'http://www.whateverwebsitenameis.org'

doc = Nokogiri::HTML(open(url))

This gets me data off the website I want to scrape.

The segment of the site I want looks like this (from FF 'view
source' )

-------------------------------------------------------------------------
<h2>Association Detail</h2>

    <div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

1) <b>Some Institute name</b><Br><br>
2) some address<Br> city, st zip<br>
3)
4) United States <Br>
5)
6) Phone:
7)
8) (123) 456-7890<Br>
9)
10 <br>
11) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

    <br><br>

    <A href="javascript:history.back();">Back to Search Results</
><br><br>

    <A href="AssociationSearch.cfm">Search Again</a>

</td>
---------------------------------------------------------------------------------

I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: 'http://www.xyz.org'

I can find the beginning of this section of code by doing this:

doc.css('h2').each do |elem| puts elem.content end
which displays 'Association Detail'

I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific 'Association Detail'
details. I've tried it with 'xpath' and 'search' according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri

but there's something I'm just not getting correctly when I use other
elements get info from.

My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0

Thanks in advance for any help.

You aren't really searching by css, which would involve things like
searching for tags based on their 'class' attribute or 'id' attribute.
Because the <h2> tag doesn't have any attributes, you are simply
searching by tag name, so you could do this instead:

doc.xpath('//h2').each do |h2|
puts h2.content
end

That uses xpath notation to find all h2 tags on the page. Then you
might write something like this:

doc = Nokogiri::HTML.parse(html)

doc.xpath('//h2').each do |h2|

  if h2.content == "Association Detail"
    puts "---"
    puts h2.next.content
    puts "---"
  end

end

Knowing you can do that will enable you to write something like this:

results =

doc.xpath('//h2').each do |h2|

if h2.content == "Association Detail"
curr_elmt = h2

    while (curr_elmt = curr_elmt.next)
      curr_content = curr_elmt.content
      results << curr_content
      break if curr_content.include?("Web address:")
    end

end
end

results.each do |result|
  puts "--start--"
  puts result
  puts "--end--"
  puts
end

output=

--start--
DETAIL
DIRECTORY RESULTS
--end--

--start--
Some Institute name
--end--

--start--

--end--

--start--

--end--

--start--

some address city, st zip

United States

Phone:

(123) 456-7890
) Web address: www.xyz.orgBack to Search Results

Search Again

--end--

As you can see, the html is pretty bad, so your results aren't that
great. You will have to figure out how to extract the data you need
from those strings.

···

--
Posted via http://www.ruby-forum.com/\.

Mark_Thomas · 9 November 2009 02:20

This should get what you want:

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |data,xpath|
puts "#{data} = " + doc.search(xpath).to_s.strip
end

-- Mark.

7stud · 9 November 2009 00:16

7stud -- wrote:

jzakiya wrote:

I'm trying to scrape some data off websites using nokogiri

I chopped off the top of my code, which looks like this:

require 'rubygems'
require 'nokogiri'

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<A href="javascript:history.back();">Back to Search Results</

<br><br>

doc = Nokogiri::HTML.parse(html)

<snip>

···

--
Posted via http://www.ruby-forum.com/\.

7stud · 9 November 2009 05:37

Mark Thomas wrote:

This should get what you want:

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |data,xpath|
puts "#{data} = " + doc.search(xpath).to_s.strip
end

-- Mark.

I was wondering if you could answer some xpath questions? I would think
that in this xpath:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

the part:

div[@class="sectionHeaderText"]/following-sibling

would be the <b> tag. Then:

div[@class="sectionHeaderText"]/following-sibling::text()

would be the <b> tag's text or "Some Institute name". So then the
following [2]:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

doesn't seem applicable. And in fact, when I run your code, it doesn't
work:

addr =
citystzip =
name = Some Institute name
country =
phone =

···

===========

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<A href="javascript:history.back();">Back to Search Results</

<br><br>

<A href="AssociationSearch.cfm">Search Again</a>
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |key, val|
puts "#{key} = " + doc.search(val).to_s.strip
end

--
Posted via http://www.ruby-forum.com/\.

7stud · 9 November 2009 00:20

7stud -- wrote:

7stud -- wrote:

jzakiya wrote:

Argh. Now I've chomped off the bottom of the html. This is what I used:

require 'rubygems'
require 'nokogiri'

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<A href="javascript:history.back();">Back to Search Results</

<br><br>

<A href="AssociationSearch.cfm">Search Again</a>
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

...rest of code

···

--
Posted via http://www.ruby-forum.com/\.

jzakiya1 · 9 November 2009 18:15

7stud's approach works, but Mark's doesn't (currently).
Here's the file I created which will get me all the raw
data I want (still have to process to get to final form).

file: scrape.rb

···

On Nov 9, 12:37 am, 7stud -- <bbxx789_0...@yahoo.com> wrote:

Mark Thomas wrote:
> This should get what you want:

> prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
> xpaths = {
> :name => "#{prefix}b/text()",
> :addr => "#{prefix}text()[2]",
> :citystzip => "#{prefix}text()[3]",
> :country => "#{prefix}text()[4]",
> :phone => "#{prefix}text()[5]",
> }
> xpaths.each do |data,xpath|
> puts "#{data} = " + doc.search(xpath).to_s.strip
> end

> -- Mark.

I was wondering if you could answer some xpath questions? I would think
that in this xpath:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

the part:

div[@class="sectionHeaderText"]/following-sibling

would be the <b> tag. Then:

div[@class="sectionHeaderText"]/following-sibling::text()

would be the <b> tag's text or "Some Institute name". So then the
following [2]:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

doesn't seem applicable. And in fact, when I run your code, it doesn't
work:

addr =
citystzip =
name = Some Institute name
country =
phone =

===========

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>
&lt;div class=&quot;sectionHeaderText&quot; style=&quot;padding\-bottom: 6pt;&quot;&gt;DETAIL
DIRECTORY RESULTS</div>
&lt;b&gt;Some Institute name&lt;/b&gt;&lt;Br&gt;&lt;br&gt;
some address<Br> city, st zip<br>
United States &lt;Br&gt;

  Phone:

    $123$ 456\-7890&lt;Br&gt;

&lt;br&gt;
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>
&lt;br&gt;&lt;br&gt;

&lt;A href=&quot;javascript:history\.back;&quot;&gt;Back to Search Results&lt;/
><br><br>
&lt;A href=&quot;AssociationSearch\.cfm&quot;&gt;Search Again&lt;/a&gt;
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",}

xpaths.each do |key, val|
puts "#{key} = " + doc.search(val).to_s.strip
end

--
Posted viahttp://www.ruby-forum.com/.

-------------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

  id = id.to_s
  url = "http://www.xyz.org/../../..ID=#{id\}"
  doc = Nokogiri::HTML.parse(open(url))

results =

  doc.xpath('//h2').each do |h2|
    if h2.content == "Association Detail"
      curr_elmt = h2
      while (curr_elmt = curr_elmt.next)
        curr_content = curr_elmt.content.gsub(/\n|\t|\r/,'').squeeze
(' ').strip
        results << curr_content unless curr_content.strip.empty?
        break if curr_content.include?("Back to Search Results")
      end
    end
  end

  results.each do |result|
    #Do while result is not a blank string
    puts "--start--"
    puts result
    puts "--end--"
  end
  return results
end
---------------------------------------

So I just 'require' this file, and can then do:

info = scrape 1234

where 'info' is the array 'results'. I can then process
that to my hearts delight.

Thanks 7stud for your help.
I would, however, like to know if Mark's way can be made to work too.

Jabari

Mark_Thomas · 10 November 2009 02:35

Mark Thomas wrote:
> This should get what you want:

> prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
> xpaths = {
> :name => "#{prefix}b/text()",
> :addr => "#{prefix}text()[2]",
> :citystzip => "#{prefix}text()[3]",
> :country => "#{prefix}text()[4]",
> :phone => "#{prefix}text()[5]",
> }
> xpaths.each do |data,xpath|
> puts "#{data} = " + doc.search(xpath).to_s.strip
> end

> -- Mark.

I was wondering if you could answer some xpath questions? I would think
that in this xpath:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

the part:

div[@class="sectionHeaderText"]/following-sibling

would be the <b> tag.

Not quite. following-sibling:: is an axis predicate that needs to be
followed by a node. Therefore following-sibling::text() is a set of
all text nodes after the div. After that, it's just a matter of
indexing.

doesn't seem applicable. And in fact, when I run your code, it doesn't
work:

As I just posted in another message, it works for me. I wonder what's
different about my environment. Are you using Nokogiri 1.4.0?

···

On Nov 9, 12:37 am, 7stud -- <bbxx789_0...@yahoo.com> wrote:

Mark_Thomas · 10 November 2009 02:20

7stud's approach works, but Mark's doesn't (currently).

Strange... it works for me.

mark@ubuntu:~$ ruby -v
ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]

Nokogiri 1.4.0
libxslt 1.1.24-2ubuntu2

Here's the entire working program:

require 'rubygems'
require 'nokogiri'

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<A href="javascript:history.back();">Back to Search Results</

<br><br>

<A href="AssociationSearch.cfm">Search Again</a>
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)
prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |k,xpath|
puts "#{k} = " + doc.search(xpath).to_s.strip
end

# Output:
addr = some address
citystzip = city, st zip
country = United States
phone = Phone:

(123) 456-7890
name = Some Institute name

7stud · 10 November 2009 03:51

Mark Thomas wrote:

As I just posted in another message, it works for me. I wonder what's
different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version

···

---
nokogiri: 1.4.0
warnings:

libxml:
  compiled: 2.6.16
  loaded: 2.6.16
  binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.
--
Posted via http://www.ruby-forum.com/\.

jzakiya1 · 10 November 2009 15:45

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here's the file I used with Mark's approach:

file: scrape1.rb

···

On Nov 9, 10:51 pm, 7stud -- <bbxx789_0...@yahoo.com> wrote:

Mark Thomas wrote:

> As I just posted in another message, it works for me. I wonder what's
> different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings:

libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.
--
Posted viahttp://www.ruby-forum.com/.

---------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

  id = id.to_s
  url = "Welcome to ASAE — American Society of Association Executives
&type=association"
  doc = Nokogiri::HTML.parse(open(url))

  prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
  xpaths = {
   :name => "#{prefix}b/text()",
   :addr => "#{prefix}text()[2]",
   :citystzip => "#{prefix}text()[3]",
   :country => "#{prefix}text()[4]",
   :phone => "#{prefix}text()[5]",
   :web => "#{prefix}text()[6]",
   :url => "#{prefix}text()[7]"
  }

  results = {}
  xpaths.each do |data,xpath|
    results[data] = doc.search(xpath).to_s.gsub(/\n|\t|\r/,'').squeeze
(' ').strip
    puts "#{data} = " + results[data]
  end
  return results
end
---------------------------------

And use as before: info = scrape 1234

Jabari

jzakiya1 · 10 November 2009 16:10

Mark Thomas wrote:

> As I just posted in another message, it works for me. I wonder what's
> different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings:

libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.
--
Posted viahttp://www.ruby-forum.com/.

- Hide quoted text -
- Show quoted text -

Mark Thomas wrote:

> As I just posted in another message, it works for me. I wonder what's
> different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings:

libxml:
  compiled: 2.6.16
  loaded: 2.6.16
  binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.
--
Posted viahttp://www.ruby-forum.com/.

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here's the file I used with Mark's approach:

File: scrape1.rb

···

On Nov 9, 10:51 pm, 7stud -- <bbxx789_0...@yahoo.com> wrote:
On Nov 9, 10:51 pm, 7stud -- <bbxx789_0...@yahoo.com> wrote:
----------------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

  id = id.to_s
  url = "http://www.xyz.org/../../..ID=#{id\}"
  doc = Nokogiri::HTML.parse(open(url))

  prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
  xpaths = {
   :name => "#{prefix}b/text()",
   :addr => "#{prefix}text()[2]",
   :citystzip => "#{prefix}text()[3]",
   :country => "#{prefix}text()[4]",
   :phone => "#{prefix}text()[5]",
   :web => "#{prefix}text()[6]",
   :url => "#{prefix}text()[7]"
  }

  results = {}
  xpaths.each do |data,xpath|
    results[data] = doc.search(xpath).to_s.gsub(/\n|\t|\r/,'').squeeze
(' ').strip
    puts "#{data} = " + results[data]
  end
  return results
end
------------------------------

And use as before: info = scrape 1234

Mark_Thomas · 11 November 2009 13:55

$ nokogiri -v

cool! I didn't know about that.

HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings:

libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension

This is most likely the problem.

Mine reports:
libxml:
  loaded: 2.7.5
  binding: extension
  compiled: 2.7.5

with no warnings.

Can you install a newer version of libxml2? As you can see from
NEWS · master · GNOME / libxml2 · GitLab, your version dates back to 2004 with
tons of bug fixes (including XPath fixes) since.

···

On Nov 9, 10:51 pm, 7stud -- <bbxx789_0...@yahoo.com> wrote:

Mark_Thomas · 11 November 2009 03:30

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here's the file I used with Mark's approach:

File: scrape1.rb
----------------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

id = id.to_s
url = "http://www.xyz.org/../../..ID=#{id\}"
doc = Nokogiri::HTML.parse(open(url))

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
:web => "#{prefix}text()[6]",
:url => "#{prefix}text()[7]"

You'll need to modify that last line. Unlike the other items, the URL
is not in a text node, it is the href attribute of the first <a>
element. So try:

:url => "#{prefix}a[1]/@href"

7stud · 12 November 2009 03:10

Mark Thomas wrote:

···

On Nov 9, 10:51�pm, 7stud -- <bbxx789_0...@yahoo.com> wrote:

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings:

libxml:
� compiled: 2.6.16
� loaded: 2.6.16
� binding: extension

This is most likely the problem.

Mine reports:
libxml:
  loaded: 2.7.5
  binding: extension
  compiled: 2.7.5

with no warnings.

Can you install a newer version of libxml2? As you can see from
NEWS · master · GNOME / libxml2 · GitLab, your version dates back to 2004 with
tons of bug fixes (including XPath fixes) since.

I've looked into installing newer versions of libxml2 and libxslt, but
it looks complicated and fraught with danger for mac os x.

...
,
,
,
,
,
,
,
,
,
,
,

--
Posted via http://www.ruby-forum.com/\.

jzakiya1 · 12 November 2009 16:55

Yes, this allows me to capture the url I want (and sometimes ones I
don't want), and I'm able to post-process xpaths to get everything I
need.

  xpaths = {
   :name => "#{prefix}b/text()",
   :addr => "#{prefix}text()[2]",
   :citystzip => "#{prefix}text()[3]",
   :country => "#{prefix}text()[4]",
   :phone => "#{prefix}text()[5]",
   :url => "#{prefix}a[1]/@href"
  }

Now, I just need to understand completely WHY/HOW it works.

Jabari

···

On Nov 10, 10:29 pm, Mark Thomas <m...@thomaszone.com> wrote:

> OK, when I put Mark's code in a file and ran it (versus entering it in
> a irb session) it DOES work. However, it doesn't capture the website
> url, which 7stud's approach does. I haven't figure out how to do it
> with this approach, and merely adding more items in xpaths doesn't
> work.

> So Mark, how can your approach be used to capture the url add the end
> of the data section?

> Here's the file I used with Mark's approach:

> File: scrape1.rb
> ----------------------------
> require 'rubygems'
> require 'open-uri'
> require 'nokogiri'

> def scrape (id)

> id = id.to_s
> url = "http://www.xyz.org/../../..ID=#{id\}"
> doc = Nokogiri::HTML.parse(open(url))

> prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
> xpaths = {
> :name => "#{prefix}b/text()",
> :addr => "#{prefix}text()[2]",
> :citystzip => "#{prefix}text()[3]",
> :country => "#{prefix}text()[4]",
> :phone => "#{prefix}text()[5]",
> :web => "#{prefix}text()[6]",
> :url => "#{prefix}text()[7]"

You'll need to modify that last line. Unlike the other items, the URL
is not in a text node, it is the href attribute of the first <a>
element. So try:
:url =&gt; &quot;\#\{prefix\}a\[1\]/@href&quot;

Mark_Thomas · 13 November 2009 01:25

Let's take the first one as an example. I noticed that everything was
after a div with the class "sectionHeaderText", so I started with
that:

//div[@class="sectionHeaderText"]

The double slash is a wildcard that means the div can be anywhere. The
part in brackets is called a predicate, and it constrains the
expression. I like to think of it as a "such that" clause. So you can
read the above as "a div such that the class is
'sectionHeaderText'." (Actually, it's the set of all divs for which it
is true, so if you had multiple divs with the same class, it would
return them all)

Then I noticed that the items you wanted were not children of the div.
The div closes before you get to the text you want. Even <br> tags are
considered to be <br/> which are self-closing. Therefore almost
everything you want is at the same nesting depth, or in XPath
terminology, they are siblings. The "following-sibling" is an XPath
"axis" (see the W3C Schools XPath tutorial for details on these
things). The name though was inside a <b> element so I used the XPath
expression to get the following sibling that happens to be a <b>
element:

//div[@class="sectionHeaderText"]/following-sibling::b

Then, how you get text from within a node is the XPath function text()
which means all the text between tags, including whitespace.

//div[@class="sectionHeaderText"]/following-sibling::b/text()

And there you have the name.

Now, the other things were text nodes between <br> elements. You could
pull them all by asking for the set of text node siblings of the div:

//div[@class="sectionHeaderText"]/following-sibling::text()

But when you get more stuff than you want like that, you can index
them like an array:

//div[@class="sectionHeaderText"]/following-sibling::text()[2]

and that happens to pull the street address.

So hopefully you see how the XPaths were put together. Usually they
are a bit simpler, but like 7stud said, it was pretty crappy HTML.

-- Mark.

···

On Nov 12, 11:52 am, jzakiya <jzak...@mail.com> wrote:

On Nov 10, 10:29 pm, Mark Thomas <m...@thomaszone.com> wrote:

> > OK, when I put Mark's code in a file and ran it (versus entering it in
> > a irb session) it DOES work. However, it doesn't capture the website
> > url, which 7stud's approach does. I haven't figure out how to do it
> > with this approach, and merely adding more items in xpaths doesn't
> > work.

> > So Mark, how can your approach be used to capture the url add the end
> > of the data section?

> > Here's the file I used with Mark's approach:

> > File: scrape1.rb
> > ----------------------------
> > require 'rubygems'
> > require 'open-uri'
> > require 'nokogiri'

> > def scrape (id)

> > id = id.to_s
> > url = "http://www.xyz.org/../../..ID=#{id\}"
> > doc = Nokogiri::HTML.parse(open(url))

> > prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
> > xpaths = {
> > :name => "#{prefix}b/text()",
> > :addr => "#{prefix}text()[2]",
> > :citystzip => "#{prefix}text()[3]",
> > :country => "#{prefix}text()[4]",
> > :phone => "#{prefix}text()[5]",
> > :web => "#{prefix}text()[6]",
> > :url => "#{prefix}text()[7]"

> You'll need to modify that last line. Unlike the other items, the URL
> is not in a text node, it is the href attribute of the first <a>
> element. So try:

> :url => "#{prefix}a[1]/@href"

Yes, this allows me to capture the url I want (and sometimes ones I
don't want), and I'm able to post-process xpaths to get everything I
need.

xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
:url => "#{prefix}a[1]/@href"
}

Now, I just need to understand completely WHY/HOW it works.

7stud · 13 November 2009 07:09

jzakiya wrote:

Now, I just need to understand completely WHY/HOW it works.

Here is a pretty good basic XPath tutorial:

http://www.w3schools.com/XPath/xpath_nodes.asp

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
How to scrap data with Nokogiri from this page? ruby-talk	2	138	19 April 2010
Nokogiri help parsing HTML ruby-talk	17	509	29 March 2013
Scrapping data from a webpage where the data is loaded dynamically ruby-talk	7	161	8 February 2014
Scrap data from jsp with Nokogiri? ruby-talk	0	132	21 April 2010
Nokogiri html xpath gestalt ruby-talk	2	415	17 December 2017

Using Nokogiri

Related topics