Parsing html table cells

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it

thanks,

Luis

lrlebron@gmail.com wrote:

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I'm new to Ruby so bear with me. I'm also learning to use hpricot and
have been able get the table rows using it

I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML document):

bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
  bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
').strip.gsub(/\s+/, ' ')
}

Which might be horribly wrong, because I find REXML's XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David Vallner

lrlebron@gmail.com wrote:

I am trying to parse an html page that has strings that looks like this

<tr class="bg2" height="17" valign="middle" align="right"><td
align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.

Try this:

···

-------------------------------------

#!/usr/bin/ruby -w

table = "<table><tr>" +
"<td>4</td><td>47</td><td>1</td><td>19</td></tr>" +
"<tr><td>7</td><td>49</td><td>4</td><td>39</td></tr>" +
"<tr><td>14</td><td>17</td><td>19</td><td>21</td>" +
"</tr></table>"

rows = table.scan(%r{<tr>.*?</tr>})

rows.each do |row|
   fields = row.scan(%r{<td>(.*?)</td>})
   puts fields.join(",")
end

-------------------------------------

Output:

4,47,1,19
7,49,4,39
14,17,19,21

--
Paul Lutus
http://www.arachnoid.com

Thanks for your help. I was able to get it with some hpricot code

intCells = tr.search("td").length

              1.upto(intCells-1) do |i|
                print tr.search("td:eq(#{i})").inner_html + ' '
              end

thanks,

Luis

David Vallner wrote:

···

lrlebron@gmail.com wrote:
> I am trying to parse an html page that has strings that looks like this
>
> <tr class="bg2" height="17" valign="middle" align="right"><td
> align="left"></td><td>4</td><td>47</td><td>1</td><td>19</td></tr> to
> get the numbers inside the table cells.
>
> I would to end up with a simple string that looks like this (for this
> row)
> 4 47 1 19
>
> The number of table cells in a row that have numbers may vary for
> different rows.
> I'm new to Ruby so bear with me. I'm also learning to use hpricot and
> have been able get the table rows using it
>

I'd use XPath, I'm not sure if that's doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I'd say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML document):

bg2_strings = doc.elements.to_a(%{//tr[@class='bg2']}).map { | bg2_row |
  bg2_row.elements.to_a('td').map { |cell| cell.text }.join('
').strip.gsub(/\s+/, ' ')
}

Which might be horribly wrong, because I find REXML's XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David Vallner

--------------enigB38FA39D7D2640E58C81CF92
Content-Type: application/pgp-signature
Content-Disposition: inline;
  filename="signature.asc"
Content-Description: OpenPGP digital signature
X-Google-AttachSize: 188