Grabbing data off a webpage

BIl_Kleb1 · 17 December 2006 15:45

OK, so I haven't done this in years.

What's the "modern" way of grabbing the data off
a webpage, e.g.,

http://yorkcountyschools.org/mves/arlist/3-3.4.htm

My initial attempt has been focused on Hpricot,

  require 'rubygems'
  require 'open-uri'
  require 'hpricot'
  doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'))

and I can find doc/"th" and doc/"tr", but what's
the best way to cram them into an array of structs
or something?

Thanks,

···

--
Bil Kleb
http://funit.rubyforge.org

Greg_Brown1 · 17 December 2006 16:22

I've actually been needing to do something like this for work and
haven't gotten around to it, so I'll take a stab at it.

require "ruport"
column_names = (doc/"th")[1..-1].map { |r| (r/"p").text }
rows = (doc/"tr")[3..-1]
parsed_rows = rows.inject { |s,a|
s << (a/"td").map { |r| (r/"td").text }
}
table = parsed_rows.to_table(column_names)

Now, I've pastied some of the things you can do from here, because
they wont translate to email well.

http://pastie.caboo.se/28169

Note, my hpricot code is sort-of hackish, cleaning that up might be a
good idea, but Ruport[0] might still be a good idea for representing
the data.

Hope this helps!

-greg

[0] http://ruport.infogami.com

···

On 12/17/06, Bil Kleb <Bil.Kleb@nasa.gov> wrote:

OK, so I haven't done this in years.

What's the "modern" way of grabbing the data off
a webpage, e.g.,

  http://yorkcountyschools.org/mves/arlist/3-3.4.htm

My initial attempt has been focused on Hpricot,

  require 'rubygems'
  require 'open-uri'
  require 'hpricot'
  doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'\))

and I can find doc/"th" and doc/"tr", but what's
the best way to cram them into an array of structs
or something?

Peter_Szinek3 · 17 December 2006 17:04

Hi Bill,

How about:

require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'enumerator'

Record = Struct.new("Record", :id, :title, :author, :book_level, :points)
records = []

cells =
Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'))/"/html/body/table/tbody/tr//td"

cells.map { |elem| elem.inner_html }.each_slice(5) do |slice|
records << Record.new(*slice)
end

HTH,
Peter

···

__
http://www.rubyrailways.com

RubyTalk · 17 December 2006 17:06

Its slow and messy, but i did it in 5 mins

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'uri'
require 'pp'

module Hpricot
module Traverse

    # Returns the node neighboring this node to the south: just below it.
    # This method includes text nodes and comments and such.
    def next_node(loop=1)
      sib = parent.children
      sib[sib.index(self) + loop] if parent
    end
  end
end

class HTMLpage
  #We save some things for a single load of the page
  #and just because
  def initialize()

    #html of the whole page. only get this once
    @page_html=nil
    load_page
  end

##Complete html of page

···

##
  def page_html
    load_page.to_html
  end
  def row(location=0)
      doc=load_page
      return doc.search("tbody").collect{|x| x.search("tr")[location] }.compact
  end
  def to_struct()
      doc=load_page
      struct=
      doc.search("tbody").each{|x|
      arr=
      x.search("td").each{|xx|
              arr.push(xx.inner_html)
        }
      (0 .. arr.size/5).each{|index|
        struct.push(Thing.new(arr[(index*5)],arr[(index*5)+1],arr[(index*5)+2],arr[(index*5)+3],arr[(index*5)+4]))
      }

      }
      return struct
  end

private

    #loads the page data
    def load_page
      #check if we have page html if so return
      if @page_html
        doc=Hpricot(@page_html)
      else
          doc=Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'\))
          @page_html=doc.to_html
      end
      return doc
    end

  end
  class Thing
    attr_reader :quiz_id,:title,:author,:booklevel,:points
    def initialize(quizID,title,author,bookLevel,points)
            @quiz_id=quizID
            @title=title
            @author=author
            @booklevel=bookLevel
            @points=points
    end
  end

page=HTMLpage.new

stuff=page.to_struct
pp stuff[0].title
pp stuff[0].author

On 12/17/06, Bil Kleb <Bil.Kleb@nasa.gov> wrote:

OK, so I haven't done this in years.

What's the "modern" way of grabbing the data off
a webpage, e.g.,

  http://yorkcountyschools.org/mves/arlist/3-3.4.htm

My initial attempt has been focused on Hpricot,

  require 'rubygems'
  require 'open-uri'
  require 'hpricot'
  doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'\))

and I can find doc/"th" and doc/"tr", but what's
the best way to cram them into an array of structs
or something?

Thanks,
--
Bil Kleb
http://funit.rubyforge.org

W_James · 17 December 2006 20:40

Bil Kleb wrote:

OK, so I haven't done this in years.

What's the "modern" way of grabbing the data off
a webpage, e.g.,

  http://yorkcountyschools.org/mves/arlist/3-3.4.htm

My initial attempt has been focused on Hpricot,

  require 'rubygems'
  require 'open-uri'
  require 'hpricot'
  doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'\))

and I can find doc/"th" and doc/"tr", but what's
the best way to cram them into an array of structs
or something?

Thanks,
--
Bil Kleb
http://funit.rubyforge.org

require 'net/http'
http = Net::HTTP.new( "yorkcountyschools.org" )
resp, data = http.get( "/mves/arlist/3-3.4.htm", nil )

table = data.scan( %r{<tr>(.*?)</tr}im ).flatten.
map{|s| s.scan( %r{<td>(.*?)</td>}i ).flatten }.
reject{|ary| ary.size != 5}

p table

BIl_Kleb1 · 18 December 2006 13:15

Bil Kleb wrote:

My initial attempt has been focused on Hpricot,
[..] I can find doc/"th" and doc/"tr", but what's
the best way to cram them into an array of structs
or something?

Thanks everyone; I'm on my way now.

Regards,

···

--
Bil Kleb
http://fun3d.larc.nasa.gov

Greg_Brown1 · 17 December 2006 16:25

Yuck, seems to have made a mess of the text output.
Here it is better formatted:

http://pastie.caboo.se/28170/text

···

On 12/17/06, Gregory Brown <gregory.t.brown@gmail.com> wrote:

On 12/17/06, Bil Kleb <Bil.Kleb@nasa.gov> wrote:
> OK, so I haven't done this in years.
>
> What's the "modern" way of grabbing the data off
> a webpage, e.g.,
>
> http://yorkcountyschools.org/mves/arlist/3-3.4.htm
>
> My initial attempt has been focused on Hpricot,
>
> require 'rubygems'
> require 'open-uri'
> require 'hpricot'
> doc = Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'\))
>
> and I can find doc/"th" and doc/"tr", but what's
> the best way to cram them into an array of structs
> or something?

I've actually been needing to do something like this for work and
haven't gotten around to it, so I'll take a stab at it.

require "ruport"
column_names = (doc/"th")[1..-1].map { |r| (r/"p").text }
rows = (doc/"tr")[3..-1]
parsed_rows = rows.inject { |s,a|
s << (a/"td").map { |r| (r/"td").text }
}
table = parsed_rows.to_table(column_names)

Now, I've pastied some of the things you can do from here, because
they wont translate to email well.

http://pastie.caboo.se/28169

Greg_Brown1 · 17 December 2006 18:11

clever solution peter.

If you wanted to adapt this to use Ruport instead of a Struct (and get
the features I showed)
Try:

records = .to_table([:id, :title, :author, :book_level, :points])

and then replace the appending code with

records << slice

This would allow struct-like, hash-like, and array-like access as well
as access to Ruport's data manipulation and formatting tools.

···

On 12/17/06, Peter Szinek <peter@rubyrailways.com> wrote:

Hi Bill,

How about:

require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'enumerator'

Record = Struct.new("Record", :id, :title, :author, :book_level, :points)
records =

cells =
Hpricot(open('http://yorkcountyschools.org/mves/arlist/3-3.4.htm'\))/"/html/body/table/tbody/tr//td"

cells.map { |elem| elem.inner_html }.each_slice(5) do |slice|
records << Record.new(*slice)
end

Greg_Brown1 · 18 December 2006 00:46

require "open-uri"
body = open("yorkcountyschools.org/mves/arlist/3-3.4.htm").read

···

On 12/17/06, William James <w_a_x_man@yahoo.com> wrote:

require 'net/http'
http = Net::HTTP.new( "yorkcountyschools.org" )
resp, data = http.get( "/mves/arlist/3-3.4.htm", nil )

Peter_Szinek3 · 17 December 2006 18:18

This would allow struct-like, hash-like, and array-like access as well
as access to Ruport's data manipulation and formatting tools.

Thx for the pointer Gregory, I did not know about Ruport yet - seems
very interesting, I will definitely check it out.

Cheers,
Peter

···

__
http://www.rubyrailways.com

Greg_Brown1 · 18 December 2006 00:47

whoops... need the http://

···

On 12/17/06, Gregory Brown <gregory.t.brown@gmail.com> wrote:

require "open-uri"
body = open("yorkcountyschools.org/mves/arlist/3-3.4.htm").read

Greg_Brown1 · 17 December 2006 18:27

It might be overkill if all you needed was struct like access to your
data, but it would sure come in handy if you had some more complex
needs...

···

On 12/17/06, Peter Szinek <peter@rubyrailways.com> wrote:

> This would allow struct-like, hash-like, and array-like access as well
> as access to Ruport's data manipulation and formatting tools.

Thx for the pointer Gregory, I did not know about Ruport yet - seems
very interesting, I will definitely check it out.

Topic		Replies	Views
Hpricot getting a table ruby-talk	4	67	18 April 2007
[QUIZ] Posix Pangrams (#97) ruby-talk	2	78	6 October 2006
Help with Hpricot and collect ruby-talk	1	133	18 December 2008
Using HPricot to parse a fiddly table ruby-talk	2	115	7 January 2008
Extracing the URL from hpricot element ruby-talk	1	134	10 December 2008

Grabbing data off a webpage

Related topics