Parsing html table cells

lrlebron · 11 November 2006 13:30

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it

thanks,

Luis

David_Vallner · 11 November 2006 13:53

lrlebron@gmail.com wrote:

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it

I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML document):

bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
').strip.gsub(/\s+/, ' ')
}

Which might be horribly wrong, because I find REXML's XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David Vallner

Paul_Lutus · 12 November 2006 00:30

lrlebron@gmail.com wrote:

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.

Try this:

···

-------------------------------------

#!/usr/bin/ruby -w

table = "<table><tr>" +
"<td>4</td><td>47</td><td>1</td><td>19</td></tr>" +
"<tr><td>7</td><td>49</td><td>4</td><td>39</td></tr>" +
"<tr><td>14</td><td>17</td><td>19</td><td>21</td>" +
"</tr></table>"

rows = table.scan(%r{<tr>.*?</tr>})

rows.each do |row|
fields = row.scan(%r{<td>(.*?)</td>})
puts fields.join(",")
end

-------------------------------------

Output:

4,47,1,19
7,49,4,39
14,17,19,21

--
Paul Lutus
http://www.arachnoid.com

lrlebron · 11 November 2006 15:25

Thanks for your help. I was able to get it with some hpricot code

intCells = tr.search("td").length

              1.upto(intCells-1) do |i|
                print tr.search("td:eq(#{i})").inner_html + ' '
              end

thanks,

Luis

David Vallner wrote:

···

lrlebron@gmail.com wrote:
> I am trying to parse an html page that has strings that looks like this
>
> <tr class="bg2" height="17" valign="middle" align="right"><td
> align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
> get the numbers inside the table cells.
>
> I would to end up with a simple string that looks like this (for this
> row)
> 4 47 1 19
>
> The number of table cells in a row that have numbers may vary for
> different rows.
> I'm new to Ruby so bear with me. I'm also learning to use hpricot and
> have been able get the table rows using it
>

I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML document):

bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
').strip.gsub(/\s+/, ' ')
}

Which might be horribly wrong, because I find REXML's XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David Vallner

--------------enigB38FA39D7D2640E58C81CF92
Content-Type: application/pgp-signature
Content-Disposition: inline;
filename="signature.asc"
Content-Description: OpenPGP digital signature
X-Google-AttachSize: 188

Topic		Replies	Views
Using HPricot to parse a fiddly table ruby-talk	2	115	7 January 2008
How to read table's cell in ruby? ruby-talk	2	129	23 March 2007
Html parsing with Hpricot ruby-talk	2	83	9 June 2010
REXML screen scraping questions ruby-talk	4	68	15 September 2005
[ask]How to remove HTML part of a text ruby-talk	3	126	26 December 2009

Parsing html table cells

Related topics