Help with net/http

A_Mcbomb · 9 December 2010 20:43

I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.

Here is my code:

···

------------------------
temparray = Array.new

   url = URI.parse("http://www.apartment-directory.info")
   res = Net::HTTP.start(url.host, url.port) {|http|
   http.get('/connecticut/0')
   }
   # puts res.body

   res.body.each_line {|line|
      line.gsub!(/\"/, '')
     temparray.push(line) if line =~ /<td\svalign=top/
      }
          temparray.each do |j|
             # j.gsub!(/<a\shref=\/map.*<\/a>/,'')
              j.gsub!(/\shref=\/map\//,'')
              j.gsub!(/\d+\sclass=map>Map\&nbsp\;It!/,'')
              j.gsub!(/<\/td>/,'')
              j.gsub!(/<td\svalign=top>/, '')
              j.gsub!(/<td\svalign=top\snowrap>/, '')
              j.gsub!(/<tr\sbgcolor=white>/, '<br>')
              j.gsub!(/MapIt!/, ', ')
              j.gsub!(/\(/, ', (')
              j.gsub!(/<\/tr>/,'')

           puts j
       }
            end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.

Is there a better way to pull out the pertinent info and avoid all the
HTML tags?

thanks

atomic

--
Posted via http://www.ruby-forum.com/.

Alex_Stahl · 9 December 2010 21:02

Nokogiri provides a great interface for accessing the data trapped
inside markup.

Try something like:

page = Nokogiri::HTML res.body
data =
page.xpath("//xpath/to/table").each do |node|
data << node.xpath("./rel/xpath/to/data/text()")
end

···

________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | astahl@hi5.com

On Thu, 2010-12-09 at 14:43 -0600, Atomic Bomb wrote:

I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.

Here is my code:
------------------------
   temparray = Array.new

   url = URI.parse("http://www.apartment-directory.info")
   res = Net::HTTP.start(url.host, url.port) {|http|
   http.get('/connecticut/0')
   }
   # puts res.body

   res.body.each_line {|line|
      line.gsub!(/\"/, '')
     temparray.push(line) if line =~ /<td\svalign=top/
      }
          temparray.each do |j|
             # j.gsub!(/<a\shref=\/map.*<\/a>/,'')
              j.gsub!(/\shref=\/map\//,'')
              j.gsub!(/\d+\sclass=map>Map\&nbsp\;It!/,'')
              j.gsub!(/<\/td>/,'')
              j.gsub!(/<td\svalign=top>/, '')
              j.gsub!(/<td\svalign=top\snowrap>/, '')
              j.gsub!(/<tr\sbgcolor=white>/, '<br>')
              j.gsub!(/MapIt!/, ', ')
              j.gsub!(/$/, ', (')
              j.gsub!(/<\/tr>/,'')

           puts j
       }
            end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html$ but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.

Is there a better way to pull out the pertinent info and avoid all the
HTML tags?

thanks

atomic

A_Mcbomb · 10 December 2010 05:28

Thanks Alex.

I tried following the instructions to install the Nokogiri gem but it
gave me a few errors. I tried linking the libraries during the install:

[server01][/]$ gem install nokogiri -- --with-xml2-lib=/usr/local/lib
--with-xml2-include=/usr/local/include/libxml2
--with-xslt-lib=/usr/local/lib
--with-xslt-include=/usr/local/include/libxslt
Building native extensions. This could take a while...
Successfully installed nokogiri-1.4.4
1 gem installed
Installing ri documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
Installing RDoc documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with

···

---

As a test, I created a test file with the following code:

require 'open-uri'
doc = Nokogiri::HTML(open("http://www.anysite.com/"))

But when I run it, I get the following so I don't think the gem in
installed correctly:

[server01][/usr/bin]$ ./test.rb
./test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

thanks

atomic

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 10 December 2010 08:19

You have to require 'nokogiri'

Jesus.

···

On Fri, Dec 10, 2010 at 6:28 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

Thanks Alex.

I tried following the instructions to install the Nokogiri gem but it
gave me a few errors. I tried linking the libraries during the install:

[server01][/]$ gem install nokogiri -- --with-xml2-lib=/usr/local/lib
--with-xml2-include=/usr/local/include/libxml2
--with-xslt-lib=/usr/local/lib
--with-xslt-include=/usr/local/include/libxslt
Building native extensions. This could take a while...
Successfully installed nokogiri-1.4.4
1 gem installed
Installing ri documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
Installing RDoc documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
---

As a test, I created a test file with the following code:

require 'open-uri'
doc = Nokogiri::HTML(open("Precisely - Better data. Better decisions.))

But when I run it, I get the following so I don't think the gem in
installed correctly:

[server01][/usr/bin]$ ./test.rb
./test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

A_Mcbomb · 10 December 2010 09:48

I didn't realized that, Jesus but it didn't help in my installation.
When I run the test script, here's what I get:

[server01][/usr/bin]$ ./test.rb
./test.rb:6:in `require': no such file to load -- nokogiri (LoadError)
from ./test.rb:6

thanks

atomic

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 10 December 2010 09:52

Did you require rubygems, before requiring nokogiri? The typical ways are:

export RUBYOPT=rubygems

or calling ruby -rubygems ./test.rb

or adding require 'rubygems' to your script (there has been
discussions here about why this is not recommended, specially for
library code)

In general, to use a gem you have to require rubygems before requiring the gem.

Jesus.

···

On Fri, Dec 10, 2010 at 10:48 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

I didn't realized that, Jesus but it didn't help in my installation.
When I run the test script, here's what I get:

[server01][/usr/bin]$ ./test.rb
./test.rb:6:in `require': no such file to load -- nokogiri (LoadError)
from ./test.rb:6

A_Mcbomb · 10 December 2010 10:39

That definately helped, Jesus....thanks.

Here is what I get now when I run the test script:

[server01][/usr/bin]$ ./test.rb
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

This sounds like I should upgrade....would you recommend just simply
replacing the one file (libxml2) and then reinstall the gem or are there
more files that I should replace such as libxslt as well?

thank you so much for helping me to get this going Jesus!

atomic

···

--
Posted via http://www.ruby-forum.com/.

Jesus_Gabriel_y_Gala · 10 December 2010 11:10

What OS (and version are you on?). I have a pretty old version of
Ubuntu (8.10) and have libxml2.so.2.6.32.
To correctly upgrade a library, please use your OS facilities (apt,
yum or whatever).

Jesus.

···

On Fri, Dec 10, 2010 at 11:39 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

That definately helped, Jesus....thanks.

Here is what I get now when I run the test script:

[server01][/usr/bin]$ ./test.rb
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

This sounds like I should upgrade....would you recommend just simply
replacing the one file (libxml2) and then reinstall the gem or are there
more files that I should replace such as libxslt as well?

thank you so much for helping me to get this going Jesus!

A_Mcbomb · 10 December 2010 12:47

Here's what my server is running:

Linux version 2.6.9-42.0.3.EL.wh1smp (root@wdl70144) (gcc version 3.4.6
20060404 (Red Hat 3.4.6-11)) #1 SMP Fri Aug 14 15:48:17 MDT 2009

The problem I run into is that since this is a shared hosting server,
they don't allow me to add RPMs to the server.

Do you know of a way to update the library with a binary file for
instance?
What libraray do I need so I can look around?

thanks again Jesus

atomic

···

--
Posted via http://www.ruby-forum.com/.

Scott_Hill · 10 December 2010 19:52

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

--Scott

···

On Fri, Dec 10, 2010 at 4:47 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

Here's what my server is running:

Linux version 2.6.9-42.0.3.EL.wh1smp (root@wdl70144) (gcc version 3.4.6
20060404 (Red Hat 3.4.6-11)) #1 SMP Fri Aug 14 15:48:17 MDT 2009

The problem I run into is that since this is a shared hosting server,
they don't allow me to add RPMs to the server.

Do you know of a way to update the library with a binary file for
instance?
What libraray do I need so I can look around?

thanks again Jesus

atomic

--
Posted via http://www.ruby-forum.com/\.

Jesus_Gabriel_y_Gala · 10 December 2010 22:38

The problem is not the gem, is the libxml2 dependency. I don't know
how to install a library locally for RedHat, maybe the OP can
investigate that. And then you have to tell nokogiri to use that local
version. I haven't looked into it, maybe it's easy, I don't know.
Maybe some other person on the list can help the OP further.

Jesus.

···

On Fri, Dec 10, 2010 at 8:52 PM, Scott Hill <stmpjmpr@gmail.com> wrote:

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

A_Mcbomb · 11 December 2010 00:38

I got one of my servers updated and I'm now running Nokogiri without
errors which is great news.

Here is my new code:

···

-------------------
url = URI.parse("http://www.apartment-directory.info")
   res = Net::HTTP.start(url.host, url.port) {|http|
   http.get('/connecticut/0')
   }

page = Nokogiri::HTML res.body
page.xpath("//tr//td/a").each do |node|
puts node.text
end
-----------------
This returns some of the data that I need but not all of it.
I do not understand this line:

page.xpath("//tr/td")

I know it is supposed to be the path to the data I need but I'm not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can't figure out.

This is one record from the webpage in HTML:
-----
<tr bgcolor=white><td valign=top><a
href="/map/22-glenbrook-road-condo-associate/stamford-connecticut-06902-(203)327-4028/14741"
title="Condominium Office Rental and Leasing, Condominiums and
Townhouses, Condominium and Townhouse Rental and Leasing ">22 Glenbrook
Road Condo Associates</a></td><td valign=top>
<a
href="/map/22-glenbrook-road-condo-associate/stamford-connecticut-06902-(203)327-4028/14741"
class=map>Map It!</a>  </td><td valign=top>22 Glenbrook
Road</td>
<td valign=top>Stamford,  CT  06902</td>
<td valign=top nowrap>(203) 327-4028</td></tr>
-----
I need to be able to get the following information for one record out:

22 Glenbrook Road Condo Associates,22 Glenbrook
Road,Stamford,CT,O6902,(203) 327-4028

I thought that if I configured Nokogiri with:
page.xpath("//tr/td")

..that is would get me inside these table brackets but it's not working.

Can you possibly point out where I'm going wrong?

thanks for the help,

atomic

--
Posted via http://www.ruby-forum.com/.

A_Mcbomb · 11 December 2010 00:46

Hang on! It is working now. As I was writing my last post, I realized I
had been using:

page.xpath("//tr//td/a") and changed it to page.xpath("//tr/td")

and tried that after my last post.

I get the following output which is good except for the A type
characters, what is the best way to get rid of those and combine the
record on the same line seperated only by commas?

MapÂ It!Â Â

90 Gerrish Avenue

East Haven,Â Â CTÂ Â 06512

(203) 466-2605

Avalon Bay Communities

MapÂ It!Â Â

66 Glenbrook Road No. 200

Stamford,Â Â CTÂ Â 06902

(203) 357-0986

Avalon Grove Luxury Apartments

thanks again,

atomic

···

--
Posted via http://www.ruby-forum.com/.

Brabuhr · 11 December 2010 01:28

You might also consider the mechanize library:
http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html

e.g.

require 'rubygems'
require 'mechanize'

Mechanize.new.get("http://www.apartment-directory.info/alabama/0") do |page|
  page.search('//tr').each do |tr|
    tds = tr.search('./td')
    puts tds[0].text.chomp rescue nil
    puts tds[2].text.chomp rescue nil
    puts tds[3].text.chomp rescue nil
    puts
  end
end

This sample script as-is is too greedy; it loops over every row of
every table table instead of just the interesting one.

$ ruby i.rb
[some garbage from other tables]
...
Aquadome Apartment
1619 8th Street Southwest
Decatur, AL 35601

Arbor Park Apartments
175 Sloan Avenue East
Talladega, AL 35160

Arbor Place Apartments
515 Fox Run Parkway No. 9A
Opelika, AL 36801

Arbor Pointe Apartments
100 Dairy Road
Mobile, AL 36612

Arboretum Apartments
1800 Arboretum Circle
Birmingham, AL 35216

Arbors On Taylor
485 Taylor Road
Montgomery, AL 36117

Arrow Head Apartments
129 South Union Avenue
Ozark, AL 36360
...

Hassan_Schroeder · 11 December 2010 01:35

page = Nokogiri::HTML res.body
page.xpath("//tr//td/a").each do |node|
puts node.text
end
-----------------
This returns some of the data that I need but not all of it.
I do not understand this line:

page.xpath("//tr/td")

That's not what you're using.

I know it is supposed to be the path to the data I need but I'm not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can't figure out.

If you *did* use `//tr/td` you *would* get all the information in the table,
only some of which is within anchor (a) tags.

···

On Fri, Dec 10, 2010 at 4:38 PM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan

A_Mcbomb · 11 December 2010 01:52

I never heard of mechanize but I see from the doco that it requires
Nokogiri to run. I have a working copy of Nokogiri and just did a
successful 'gem install mechanize' but when I ran your basic script, I
get:

[root@trebek2 bin]# ./mechanize.rb
./mechanize.rb:7: uninitialized constant Mechanize (NameError)
        from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
        from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`require'
        from ./mechanize.rb:5

Am I missing something?

atomic

···

--
Posted via http://www.ruby-forum.com/.

Brabuhr · 11 December 2010 03:46

Hmm... I'm not sure there... oh wait... Could it be confused since
your file is (also) named 'mechanize'?:

$ cp i.rb mechanize.rb
$ ruby mechanize.rb
./mechanize.rb:5: uninitialized constant Mechanize (NameError)
  from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
  from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `require'
  from mechanize.rb:3

Yeah, I think that's it. Try renaming your script.

···

On Fri, Dec 10, 2010 at 8:52 PM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

I never heard of mechanize but I see from the doco that it requires
Nokogiri to run. I have a working copy of Nokogiri and just did a
successful 'gem install mechanize' but when I ran your basic script, I
get:

[root@trebek2 bin]# ./mechanize.rb
./mechanize.rb:7: uninitialized constant Mechanize (NameError)
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`require'
from ./mechanize.rb:5

Am I missing something?

A_Mcbomb · 11 December 2010 13:01

I renamed the script and it worked!
Pretty nice....thanks.

My last question though is, what is the easiest way to get rid of all
the garbage that I don't want from the other tables?

thanks alot

atomic

···

--
Posted via http://www.ruby-forum.com/.

Brabuhr · 11 December 2010 19:27

My last question though is, what is the easiest way to get rid of all
the garbage that I don't want from the other tables?

Try to narrow down the xpath used to pull stuff out. E.g.

//table[2]/tbody/tr

to only get the rows from the second table.

Topic		Replies	Views
Using Nokogiri ruby-talk	17	162	13 November 2009
Ruby screen scraping ruby-talk	27	159	21 November 2006
Nikogiri ruby-talk	20	579	14 August 2016
Simple screen scraper using scrAPI ruby-talk	14	172	30 November 2006
Scan HTML ruby-talk	15	142	3 March 2008

Help with net/http

Related topics