Help with net/http

I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.

Here is my code:

···

------------------------
   temparray = Array.new

   url = URI.parse("http://www.apartment-directory.info")
   res = Net::HTTP.start(url.host, url.port) {|http|
   http.get('/connecticut/0')
   }
   # puts res.body

   res.body.each_line {|line|
      line.gsub!(/\"/, '')
     temparray.push(line) if line =~ /<td\svalign=top/
      }
          temparray.each do |j|
             # j.gsub!(/<a\shref=\/map.*<\/a>/,'')
              j.gsub!(/\shref=\/map\//,'')
              j.gsub!(/\d+\sclass=map>Map\&nbsp\;It!/,'')
              j.gsub!(/<\/td>/,'')
              j.gsub!(/<td\svalign=top>/, '')
              j.gsub!(/<td\svalign=top\snowrap>/, '')
              j.gsub!(/<tr\sbgcolor=white>/, '<br>')
              j.gsub!(/MapIt!/, ', ')
              j.gsub!(/\(/, ', (')
              j.gsub!(/<\/tr>/,'')

           puts j
       }
            end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.

Is there a better way to pull out the pertinent info and avoid all the
HTML tags?

thanks

atomic

--
Posted via http://www.ruby-forum.com/.

Nokogiri provides a great interface for accessing the data trapped
inside markup.

Try something like:

page = Nokogiri::HTML res.body
data =
page.xpath("//xpath/to/table").each do |node|
  data << node.xpath("./rel/xpath/to/data/text()")
end

···

________________________________________________________________________

Alex Stahl | Sr. Quality Engineer | hi5 Networks, Inc. | astahl@hi5.com

On Thu, 2010-12-09 at 14:43 -0600, Atomic Bomb wrote:

I am trying to screen scrape a webpage and pull out the name, address,
city, state, zip and phone on a site that lists apartments for rent.

Here is my code:
------------------------
   temparray = Array.new

   url = URI.parse("http://www.apartment-directory.info")
   res = Net::HTTP.start(url.host, url.port) {|http|
   http.get('/connecticut/0')
   }
   # puts res.body

   res.body.each_line {|line|
      line.gsub!(/\"/, '')
     temparray.push(line) if line =~ /<td\svalign=top/
      }
          temparray.each do |j|
             # j.gsub!(/<a\shref=\/map.*<\/a>/,'')
              j.gsub!(/\shref=\/map\//,'')
              j.gsub!(/\d+\sclass=map>Map\&nbsp\;It!/,'')
              j.gsub!(/<\/td>/,'')
              j.gsub!(/<td\svalign=top>/, '')
              j.gsub!(/<td\svalign=top\snowrap>/, '')
              j.gsub!(/<tr\sbgcolor=white>/, '<br>')
              j.gsub!(/MapIt!/, ', ')
              j.gsub!(/\(/, ', (')
              j.gsub!(/<\/tr>/,'')

           puts j
       }
            end
----------------------
I am able to grab the HTML from the page, I then gsub! out a " sign
then push each line that starts with <td valign=top onto an array. I
then iterate through the array and try to remove what I don't want with
more gsub! commands. The output from this still has HTML tags on it and
looks good if I output it to a html page (you can see the output here:
http://www.holy-name.org/ct.html\) but I really need to remove the HTML
tags and get just the important facts into a CSV file. Since there are 4
elements in the array for each record, the only way I could get it to
work on a web page was to add a <br> between records.

Is there a better way to pull out the pertinent info and avoid all the
HTML tags?

thanks

atomic

Thanks Alex.

I tried following the instructions to install the Nokogiri gem but it
gave me a few errors. I tried linking the libraries during the install:

[server01][/]$ gem install nokogiri -- --with-xml2-lib=/usr/local/lib
--with-xml2-include=/usr/local/include/libxml2
--with-xslt-lib=/usr/local/lib
--with-xslt-include=/usr/local/include/libxslt
Building native extensions. This could take a while...
Successfully installed nokogiri-1.4.4
1 gem installed
Installing ri documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
Installing RDoc documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with

···

---

As a test, I created a test file with the following code:

   require 'open-uri'
   doc = Nokogiri::HTML(open("http://www.anysite.com/"))

But when I run it, I get the following so I don't think the gem in
installed correctly:

[server01][/usr/bin]$ ./test.rb
./test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

thanks

atomic

--
Posted via http://www.ruby-forum.com/.

You have to require 'nokogiri'

Jesus.

···

On Fri, Dec 10, 2010 at 6:28 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

Thanks Alex.

I tried following the instructions to install the Nokogiri gem but it
gave me a few errors. I tried linking the libraries during the install:

[server01][/]$ gem install nokogiri -- --with-xml2-lib=/usr/local/lib
--with-xml2-include=/usr/local/include/libxml2
--with-xslt-lib=/usr/local/lib
--with-xslt-include=/usr/local/include/libxslt
Building native extensions. This could take a while...
Successfully installed nokogiri-1.4.4
1 gem installed
Installing ri documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
Installing RDoc documentation for nokogiri-1.4.4...

No definition for get_options

No definition for set_options

No definition for parse_memory

No definition for parse_file

No definition for parse_with
---

As a test, I created a test file with the following code:

require 'open-uri'
doc = Nokogiri::HTML(open("Precisely - Better data. Better decisions.))

But when I run it, I get the following so I don't think the gem in
installed correctly:

[server01][/usr/bin]$ ./test.rb
./test.rb:7: uninitialized constant Nokogiri (NameError)

Would you be able to suggest anything to help me get Nokogiri installed
and working?

I didn't realized that, Jesus but it didn't help in my installation.
When I run the test script, here's what I get:

[server01][/usr/bin]$ ./test.rb
./test.rb:6:in `require': no such file to load -- nokogiri (LoadError)
        from ./test.rb:6

thanks

atomic

···

--
Posted via http://www.ruby-forum.com/.

Did you require rubygems, before requiring nokogiri? The typical ways are:

export RUBYOPT=rubygems

or calling ruby -rubygems ./test.rb

or adding require 'rubygems' to your script (there has been
discussions here about why this is not recommended, specially for
library code)

In general, to use a gem you have to require rubygems before requiring the gem.

Jesus.

···

On Fri, Dec 10, 2010 at 10:48 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

I didn't realized that, Jesus but it didn't help in my installation.
When I run the test script, here's what I get:

[server01][/usr/bin]$ ./test.rb
./test.rb:6:in `require': no such file to load -- nokogiri (LoadError)
from ./test.rb:6

That definately helped, Jesus....thanks.

Here is what I get now when I run the test script:

[server01][/usr/bin]$ ./test.rb
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

This sounds like I should upgrade....would you recommend just simply
replacing the one file (libxml2) and then reinstall the gem or are there
more files that I should replace such as libxslt as well?

thank you so much for helping me to get this going Jesus!

atomic

···

--
Posted via http://www.ruby-forum.com/.

What OS (and version are you on?). I have a pretty old version of
Ubuntu (8.10) and have libxml2.so.2.6.32.
To correctly upgrade a library, please use your OS facilities (apt,
yum or whatever).

Jesus.

···

On Fri, Dec 10, 2010 at 11:39 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

That definately helped, Jesus....thanks.

Here is what I get now when I run the test script:

[server01][/usr/bin]$ ./test.rb
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

This sounds like I should upgrade....would you recommend just simply
replacing the one file (libxml2) and then reinstall the gem or are there
more files that I should replace such as libxslt as well?

thank you so much for helping me to get this going Jesus!

Here's what my server is running:

Linux version 2.6.9-42.0.3.EL.wh1smp (root@wdl70144) (gcc version 3.4.6
20060404 (Red Hat 3.4.6-11)) #1 SMP Fri Aug 14 15:48:17 MDT 2009

The problem I run into is that since this is a shared hosting server,
they don't allow me to add RPMs to the server.

Do you know of a way to update the library with a binary file for
instance?
What libraray do I need so I can look around?

thanks again Jesus

atomic

···

--
Posted via http://www.ruby-forum.com/.

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

--Scott

···

On Fri, Dec 10, 2010 at 4:47 AM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

Here's what my server is running:

Linux version 2.6.9-42.0.3.EL.wh1smp (root@wdl70144) (gcc version 3.4.6
20060404 (Red Hat 3.4.6-11)) #1 SMP Fri Aug 14 15:48:17 MDT 2009

The problem I run into is that since this is a shared hosting server,
they don't allow me to add RPMs to the server.

Do you know of a way to update the library with a binary file for
instance?
What libraray do I need so I can look around?

thanks again Jesus

atomic

--
Posted via http://www.ruby-forum.com/\.

The problem is not the gem, is the libxml2 dependency. I don't know
how to install a library locally for RedHat, maybe the OP can
investigate that. And then you have to tell nokogiri to use that local
version. I haven't looked into it, maybe it's easy, I don't know.
Maybe some other person on the list can help the OP further.

Jesus.

···

On Fri, Dec 10, 2010 at 8:52 PM, Scott Hill <stmpjmpr@gmail.com> wrote:

Installing gems local to your user account might help get around some
issues. Depending on what your host allows/supports, you might look into
using RVM to manage your Ruby installations and gems.

I got one of my servers updated and I'm now running Nokogiri without
errors which is great news.

Here is my new code:

···

-------------------
url = URI.parse("http://www.apartment-directory.info")
   res = Net::HTTP.start(url.host, url.port) {|http|
   http.get('/connecticut/0')
   }

page = Nokogiri::HTML res.body
page.xpath("//tr//td/a").each do |node|
  puts node.text
end
-----------------
This returns some of the data that I need but not all of it.
I do not understand this line:

   page.xpath("//tr/td")

I know it is supposed to be the path to the data I need but I'm not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can't figure out.

This is one record from the webpage in HTML:
-----
<tr bgcolor=white><td valign=top><a
href="/map/22-glenbrook-road-condo-associate/stamford-connecticut-06902-(203)327-4028/14741"
title="Condominium Office Rental and Leasing, Condominiums and
Townhouses, Condominium and Townhouse Rental and Leasing ">22 Glenbrook
Road Condo Associates</a></td><td valign=top>
<a
href="/map/22-glenbrook-road-condo-associate/stamford-connecticut-06902-(203)327-4028/14741"
class=map>Map&nbsp;It!</a>&nbsp;&nbsp;</td><td valign=top>22 Glenbrook
Road</td>
<td valign=top>Stamford,&nbsp;&nbsp;CT&nbsp;&nbsp;06902</td>
<td valign=top nowrap>(203) 327-4028</td></tr>
-----
I need to be able to get the following information for one record out:

22 Glenbrook Road Condo Associates,22 Glenbrook
Road,Stamford,CT,O6902,(203) 327-4028

I thought that if I configured Nokogiri with:
    page.xpath("//tr/td")

..that is would get me inside these table brackets but it's not working.

Can you possibly point out where I'm going wrong?

thanks for the help,

atomic

--
Posted via http://www.ruby-forum.com/.

Hang on! It is working now. As I was writing my last post, I realized I
had been using:

page.xpath("//tr//td/a") and changed it to page.xpath("//tr/td")

and tried that after my last post.

I get the following output which is good except for the A type
characters, what is the best way to get rid of those and combine the
record on the same line seperated only by commas?

Map It! Â

90 Gerrish Avenue

East Haven,  CT  06512

(203) 466-2605

Avalon Bay Communities

Map It! Â

66 Glenbrook Road No. 200

Stamford,  CT  06902

(203) 357-0986

Avalon Grove Luxury Apartments

thanks again,

atomic

···

--
Posted via http://www.ruby-forum.com/.

You might also consider the mechanize library:
http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html

e.g.

require 'rubygems'
require 'mechanize'

Mechanize.new.get("http://www.apartment-directory.info/alabama/0") do |page|
  page.search('//tr').each do |tr|
    tds = tr.search('./td')
    puts tds[0].text.chomp rescue nil
    puts tds[2].text.chomp rescue nil
    puts tds[3].text.chomp rescue nil
    puts
  end
end

This sample script as-is is too greedy; it loops over every row of
every table table instead of just the interesting one.

$ ruby i.rb
[some garbage from other tables]
...
Aquadome Apartment
1619 8th Street Southwest
Decatur, AL 35601

Arbor Park Apartments
175 Sloan Avenue East
Talladega, AL 35160

Arbor Place Apartments
515 Fox Run Parkway No. 9A
Opelika, AL 36801

Arbor Pointe Apartments
100 Dairy Road
Mobile, AL 36612

Arboretum Apartments
1800 Arboretum Circle
Birmingham, AL 35216

Arbors On Taylor
485 Taylor Road
Montgomery, AL 36117

Arrow Head Apartments
129 South Union Avenue
Ozark, AL 36360
...

page = Nokogiri::HTML res.body
page.xpath("//tr//td/a").each do |node|
puts node.text
end
-----------------
This returns some of the data that I need but not all of it.
I do not understand this line:

page.xpath("//tr/td")

That's not what you're using.

I know it is supposed to be the path to the data I need but I'm not sure
how I can get to all the data I need from the URL, it seems like some of
the data is between tags that I can't figure out.

If you *did* use `//tr/td` you *would* get all the information in the table,
only some of which is within anchor (a) tags.

···

On Fri, Dec 10, 2010 at 4:38 PM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

--
Hassan Schroeder ------------------------ hassan.schroeder@gmail.com
twitter: @hassan

I never heard of mechanize but I see from the doco that it requires
Nokogiri to run. I have a working copy of Nokogiri and just did a
successful 'gem install mechanize' but when I ran your basic script, I
get:

[root@trebek2 bin]# ./mechanize.rb
./mechanize.rb:7: uninitialized constant Mechanize (NameError)
        from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
        from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`require'
        from ./mechanize.rb:5

Am I missing something?

atomic

···

--
Posted via http://www.ruby-forum.com/.

Hmm... I'm not sure there... oh wait... Could it be confused since
your file is (also) named 'mechanize'?:

$ cp i.rb mechanize.rb
$ ruby mechanize.rb
./mechanize.rb:5: uninitialized constant Mechanize (NameError)
  from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
  from /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `require'
  from mechanize.rb:3

Yeah, I think that's it. Try renaming your script.

···

On Fri, Dec 10, 2010 at 8:52 PM, A. Mcbomb <atomicmcbomb@gmail.com> wrote:

I never heard of mechanize but I see from the doco that it requires
Nokogiri to run. I have a working copy of Nokogiri and just did a
successful 'gem install mechanize' but when I ran your basic script, I
get:

[root@trebek2 bin]# ./mechanize.rb
./mechanize.rb:7: uninitialized constant Mechanize (NameError)
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`gem_original_require'
from
/usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in
`require'
from ./mechanize.rb:5

Am I missing something?

I renamed the script and it worked!
Pretty nice....thanks.

My last question though is, what is the easiest way to get rid of all
the garbage that I don't want from the other tables?

thanks alot

atomic

···

--
Posted via http://www.ruby-forum.com/.

My last question though is, what is the easiest way to get rid of all
the garbage that I don't want from the other tables?

Try to narrow down the xpath used to pull stuff out. E.g.

    //table[2]/tbody/tr

to only get the rows from the second table.