How to get data from html table

Vikash_Kumar · 27 November 2006 11:20

I want to store the values of a table in different variables, I have the
following table structure:

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

···

--
Posted via http://www.ruby-forum.com/.

Peter_Szinek3 · 27 November 2006 11:51

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

One possible way:

Record = Struct.new("Record", :name, :date, :name_again, :some_num,
:buy_link, :some_num2, :letters, :price)
records =

doc = Hpricot(doc)
stuff = doc/"/table/tr/td"

elements = stuff.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1..record.size) }

Note that since I did not know the semantics of the table cells,
sometimes the Struct Record has some weird fields in it, but you get the
idea.

Also I am not 100% sure if the sort_by should not be done on to_f-d
prices (probably not due to rounding problems, but I wonder if there can
be some weird string issues, too).

HTH,
Peter

···

__
http://www.rubyrailways.com

Park_Heesob1 · 27 November 2006 16:32

Hi,

From: Vikash Kumar <vikashkumar051@gmail.com>
Reply-To: ruby-talk@ruby-lang.org
To: ruby-talk@ruby-lang.org (ruby-talk ML)
Subject: How to get data from html table
Date: Mon, 27 Nov 2006 20:20:54 +0900

I want to store the values of a table in different variables, I have the
following table structure:

<table width="579">
  <tr class="even">
    <td class width="65"> Case5-04</td>
    <td class width="130">10/11/2006 23:24:33</td>
    <td class width="61">Case5-04</td>
    <td class width="32">1005</td>
    <td class width="59">Sell</td>
    <td class width="36">1,000</td>
    <td class width="34">ARP</td>
    <td class width="52">$36.90</td>
  </tr>
  <tr class="odd">
    <td class width="65"> Case5-03</td>
    <td class width="130">10/11/2006 23:20:07</td>
    <td class width="61">Case5-03</a></td>
    <td class width="32">1005</td>
    <td class width="59">Buy</td>
    <td class width="36">1,500</td>
    <td class width="34">ARP</td>
    <td class width="52">$36.70</td>
  </tr>
  <tr class="even">
    <td class width="65"> Case4-04</td>
    <td class width="130">10/11/2006 05:28:54</td>
    <td class width="61">Case4-04</a></td>
    <td class width="32">1004</td>
    <td class width="59">Sell</td>
    <td class width="36">300</td>
    <td class width="34">RIL</td>
    <td class width="52">$490.00</td>
  </tr>
  <tr class="odd">
    <td class width="65"> Case4-03</td>
    <td class width="130">10/11/2006 05:21:32</td>
    <td class width="61">Case4-03</a></td>
    <td class width="32">1004</td>
    <td class width="59">Buy</td>
    <td class width="36">200</td>
    <td class width="34">RIL</td>
    <td class width="52">$489.90</td>
  </tr>
</table>

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

Here is another way:

After saving the html table text to file 'w.xml',
You can deal the value like this:

require 'rexml/document'
include REXML
doc = Document.new File.new("w.xml")
doc.elements.each("*/tr/td") {|e|
puts e.texts
}

Regards,

Park Heesob

···

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/

Paul_Lutus · 27 November 2006 19:00

Vikash Kumar wrote:

I want to store the values of a table in different variables, I have the
following table structure:

<table width="579">
  <tr class="even">
    <td class width="65"> Case5-04</td>
    <td class width="130">10/11/2006 23:24:33</td>
    <td class width="61">Case5-04</td>
    <td class width="32">1005</td>
    <td class width="59">Sell</td>
    <td class width="36">1,000</td>
    <td class width="34">ARP</td>
    <td class width="52">$36.90</td>
  </tr>
  <tr class="odd">
    <td class width="65"> Case5-03</td>
    <td class width="130">10/11/2006 23:20:07</td>
    <td class width="61">Case5-03</a></td>
    <td class width="32">1005</td>
    <td class width="59">Buy</td>
    <td class width="36">1,500</td>
    <td class width="34">ARP</td>
    <td class width="52">$36.70</td>
  </tr>
  <tr class="even">
    <td class width="65"> Case4-04</td>
    <td class width="130">10/11/2006 05:28:54</td>
    <td class width="61">Case4-04</a></td>
    <td class width="32">1004</td>
    <td class width="59">Sell</td>
    <td class width="36">300</td>
    <td class width="34">RIL</td>
    <td class width="52">$490.00</td>
  </tr>
  <tr class="odd">
    <td class width="65"> Case4-03</td>
    <td class width="130">10/11/2006 05:21:32</td>
    <td class width="61">Case4-03</a></td>
    <td class width="32">1004</td>
    <td class width="59">Buy</td>
    <td class width="36">200</td>
    <td class width="34">RIL</td>
    <td class width="52">$489.90</td>
  </tr>
</table>

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

Only a few lines of Ruby are required to accomplish this:

···

------------------------------------------

#!/usr/bin/ruby -w

data = File.read(sourcefilename)

output =

html_rows = data.scan(%r{<tr.*?>(.*?)</tr>}im).flatten

html_rows.each do |row|
   # filter these undesired elements
   row.gsub!(" ","")
   row.gsub("</a>","")
   cells = row.scan(%r{<td.*?>(.*?)</td>}im).flatten
   output << cells
end

# done collecting, now display

output.each do |row|
line = row.join(",")
puts line
end

------------------------------------------

Output:

Case5-04,10/11/2006 23:24:33,Case5-04,1005,Sell,1,000,ARP,$36.90
Case5-03,10/11/2006 23:20:07,Case5-03</a>,1005,Buy,1,500,ARP,$36.70
Case4-04,10/11/2006 05:28:54,Case4-04</a>,1004,Sell,300,RIL,$490.00
Case4-03,10/11/2006 05:21:32,Case4-03</a>,1004,Buy,200,RIL,$489.90

BTW there are some errors in your HTML sample. One example are some orphan
"</a> half-tags. They are not difficult to filter out of the result.

One reason this sort of example is difficult to parse is because of the
errors. But this specific parser will do the basic parsing, and you can
always filter a few errors, as this code does.

Digression: when solving a problem like this, it is often much easier to
write a few lines of HTML than to try to use a high-powered library to
accomplish it.

--
Paul Lutus
http://www.arachnoid.com

Peter_Szinek3 · 27 November 2006 19:33

Hello,

Digression: when solving a problem like this, it is often much easier to
write a few lines of HTML than to try to use a high-powered library to
accomplish it.

I don't see why is it an advantage here. The first solution in this thread:

···

-------------------------------------------------------------------
Record = Struct.new("Record", :name, :date, :name_again, :some_num,
:buy_link, :some_num2, :letters, :price)
records =

cells = Hpricot(doc)/"/table/tr/td"

cells.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1..record.size) }
------------------------------------------------------------------

is shorter, does not care about malformed HTML and even does the sorting
which I believe was the main intention of the OP. So why not use a
high-powered library?

Discalimer: that solution was actually mine but I am not referring to it
because of this, but rather because I think that parsing all the cells
with a one liner using a robust HTML parser is actually much better in
practice than to use a basic set of regexps and then patch the results
they yield with ad-hoc rules (missing close tags etc) looked up from 3
examples. I believe the above HPricot-powered solution will work with
100 records, too (if the other 97 does not get *really* messed up - but
in that case the regexps will fail miserably too) whereas the
we-do-not-need-any-high-powered-library approach may need another 25
patches due to the other errors in the 100-record HTML...

I do not argue that parsing the page with regexps and seeing what's
going on under the hood can provide a lot of experience, but I am really
sure that feeding a real life page to a HTML parser is safer than to use
the regexp approach.

Of course if this question is just a theoretical one, and there won't be
100 (or more than 3) records, just these 3, then forget about this mail.

Cheers,
Peter

__
http://www.rubyrailways.com

Vikash_Kumar · 28 November 2006 10:03

#!/usr/bin/ruby -w

data = File.read(sourcefilename)

output =

html_rows = data.scan(%r{<tr.*?>(.*?)</tr>}im).flatten

html_rows.each do |row|
   # filter these undesired elements
   row.gsub!(" ","")
   row.gsub("</a>","")
   cells = row.scan(%r{<td.*?>(.*?)</td>}im).flatten
   output << cells
end

# done collecting, now display

output.each do |row|
   line = row.join(",")
   puts line
end

What will be right solution if some one wants to get the data from yahoo
site International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance and then displaying only some
values such as Prev Close, Last Trade. Lets suppose we go to the URL
through :

require 'watir'
include Watir
require 'hpricot'
include Hpricot
ie=Watir::IE.new
ie.goto("International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance)

Now, whats next. Also let suppose we want to get all the values of
table, we don't know the table structure then what what should be the
correct solution ?

···

--
Posted via http://www.ruby-forum.com/\.

Paul_Lutus · 27 November 2006 21:15

Peter Szinek wrote:

Hello,

Digression: when solving a problem like this, it is often much easier to
write a few lines of HTML than to try to use a high-powered library to
accomplish it.

I don't see why is it an advantage here.

You may not have noticed the degree to which the sample data varied from
cell to cell, and at times fails to match up with your named fields.

The first solution in this
thread:

-------------------------------------------------------------------
Record = Struct.new("Record", :name, :date, :name_again, :some_num,
:buy_link, :some_num2, :letters, :price)
records =

cells = Hpricot(doc)/"/table/tr/td"

cells.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1..record.size) }
------------------------------------------------------------------

is shorter, does not care about malformed HTML and even does the sorting
which I believe was the main intention of the OP. So why not use a
high-powered library?

In some cases this is not the best approach. My reply was meant to make the
OP aware of this fact.

I do not argue that parsing the page with regexps and seeing what's
going on under the hood can provide a lot of experience, but I am really
sure that feeding a real life page to a HTML parser is safer than to use
the regexp approach.

My point is simple. People who consider using a large library to meet a need
like this may not be aware that a simple solution exists. Sometimes a
simple solution is better, for example in the case of an environment with
limited resources. Also, the OP doesn't need to download and install Ruby,
but he may well have to download and install Hpricot.

Then, having been made aware of more than one option, the OP can make a more
informed decision.

In this case, the OP has the option of learning how to talk to Hpricot, or
he can learn how to talk to Ruby. The difficulty level is in the same order
of magnitude if not closer, and I have again said what I normally say in a
case like this, and in the same way -- in essence, here is an option you
may not be aware of.

···

--
Paul Lutus
http://www.arachnoid.com

Paul_Lutus · 28 November 2006 18:20

Vikash Kumar wrote:

/ ...

What will be right solution if some one wants to get the data from yahoo
site International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance and then displaying only some
values such as Prev Close, Last Trade. Lets suppose we go to the URL
through :

require 'watir'
include Watir
require 'hpricot'
include Hpricot
ie=Watir::IE.new
ie.goto("International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance)

Now, whats next.

What? What's next? You have already assumed that the Watir and Hpricot
libraries are the optimal solution for this problem. Not necessarily. There
are many circumstances where a simple Ruby solution is better. And the more
you need to know about the process of page scraping, the more likely it is
that you will want to understand and tune the details.

Also let suppose we want to get all the values of
table, we don't know the table structure then what what should be the
correct solution ?

How about this approach:

···

--------------------------------------------------

#!/usr/bin/ruby -w

require 'net/http'

# read the page data

http = Net::HTTP.new('finance.yahoo.com', 80)
resp, page = http.get('/q?s=IBM', nil )

# BEGIN processing HTML

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end

output =
table_data = parse_html(page,"table")
table_data.each do |table|
   out_row =
   row_data = parse_html(table,"tr")
   row_data.each do |row|
      cell_data = parse_html(row,"td")
      cell_data.each do |cell|
         cell.gsub!(%r{<.*?>},"")
      end
      out_row << cell_data
   end
   output << out_row
end

# END processing HTML

# examine the result

def parse_nested_array(array,tab = 0)
   n = 0
   array.each do |item|
      if(item.size > 0)
         puts "#{"\t" * tab}[#{n}] {"
         if(item.class == Array)
            parse_nested_array(item,tab+1)
         else
            puts "#{"\t" * (tab+1)}#{item}"
         end
         puts "#{"\t" * tab}}"
      end
      n += 1
   end
end

parse_nested_array(output)

--------------------------------------------------

Notice about this program that about half the code parses the Web page and
creates an array of arrays, while the remainder shows the array. The entire
task of scraping the page is carried out in the middle of the program.

If you examine the array display created in the latter part of the program,
you will see that all the data are placed in an array that can be indexed
by table, row and cell. Simply select which array elements you want.

I want to emphasize something. The 21 lines, including spaces and comments,
between "# BEGIN processing HTML" and "# END processing HTML" are all that
is required to scrape the page. After this, you simply choose which table
cells you want to use by indexing the array.

This way of scraping pages is better if you have to post-process the
extracted data, or you need a lightweight solution for environments with
limited resources, or if you want to exercise detailed control over the
scraping process, or if you don't want to try to figure out how to use a
large, powerful library that can do absolutely anything, or if you want to
learn how to create Ruby programs.

And this way of scraping pages is not for everyone.

Also, I must add, if the Web page contains certain kinds of HTML syntax
errors, in particular any unpaired <table>, <tr> or <td> tags, my program
will break, and Hpricot probably won't. If, on the other hand, the page is
syntactically correct, this program is perfectly adequate to extract the
data.

Obligatory editorial comment: Yahoo exists because it can expose you to
advertising. That is the foundation of their business model. When you
scrape pages, you avoid having to look at their advertising. If everyone
did this, for better or worse Yahoo would go out of business (or change
their business model).

Those are the facts. If this page scraping business becomes commonplace,
eventually Yahoo and other similar Web sites will choose a different
strategy, for example, they might sell subscriptions. Or they might try to
do more than they are already doing to discourage scraping. This activity
might end up being a contest between the scrapers and the scrapees, with
the scrapees making their pages more and more complex.

I think eventually these content providers might put up their content as
graphics rather than text, as the spammers are now doing. Then the scrapers
would have to invest in OCR to get the content.

This scraping activity isn't illegal, unless of course you exploit or
re-post the scraped content.

End of editorial.

--
Paul Lutus
http://www.arachnoid.com

Topic		Replies	Views
[Newbie] Getting data from html-ish like crap ruby-talk	4	141	1 March 2006
Grabbing data off a webpage ruby-talk	11	136	18 December 2006
Ruby screen scraping ruby-talk	27	108	21 November 2006
How to read table's cell in ruby? ruby-talk	2	132	23 March 2007
How to read a value from a table in tuby ruby-talk	2	133	13 September 2007

How to get data from html table

Related topics