Using HPricot to parse a fiddly table

Adam_Dullenty · 6 January 2008 19:13

Hi there,

I'm fairly new to Ruby, previously I was an average programmer in Java,
so it's all a bit foreign to me - especially XPath and cSS. I would be
grateful if someone could give me a hand with a problem I'm having. I
have a table which I'm trying to get the fields from in a certain way.
The table is in the form:

<table>
  <tr>
    <td>...stuff I don't want...</td>
  </tr>
  <tr>
    <td>
       <table>
         ------------rows i want
         <tr>
           <td>
             <table>
               <tr>
                 <td>Field 1</td>
                 <td>Field 2</td>
               </tr>
             </table>
           </td>
           <td>Field 3</td>
           <td>Field 4, Field 5</td>
         </tr>
         ------------end of rows i want
       </table>
    </td>
  </tr>
</table>

I have managed to get HPricot to parse the page and return that HTML for
the table, however I'm struggling to get it into an array in the form
["Field 1", "Field 2", "Field 3", "Field 4", "Field 5"] for each row. I
would have hoped there would be some kind of built in method for
extracting data from a table, but I can't find one.

Thanks again, look forward to a reply
Adam

···

--
Posted via http://www.ruby-forum.com/.

Steve_Ross · 6 January 2008 20:39

For the innermost table, try:

eles = doc.search('table table table td')

for the enclosing table,

eles = doc.search('table table td')

I don't suppose the semantics can be improved any -- like class names or ids?

···

On Jan 6, 2008, at 11:13 AM, Adam Dullenty wrote:

Hi there,

I'm fairly new to Ruby, previously I was an average programmer in Java,
so it's all a bit foreign to me - especially XPath and cSS. I would be
grateful if someone could give me a hand with a problem I'm having. I
have a table which I'm trying to get the fields from in a certain way.
The table is in the form:

<table>
<tr>
   <td>...stuff I don't want...</td>
</tr>
<tr>
   <td>
      <table>
        ------------rows i want
        <tr>
          <td>
            <table>
              <tr>
                <td>Field 1</td>
                <td>Field 2</td>
              </tr>
            </table>
          </td>
          <td>Field 3</td>
          <td>Field 4, Field 5</td>
        </tr>
        ------------end of rows i want
      </table>
   </td>
</tr>
</table>

I have managed to get HPricot to parse the page and return that HTML for
the table, however I'm struggling to get it into an array in the form
["Field 1", "Field 2", "Field 3", "Field 4", "Field 5"] for each row. I
would have hoped there would be some kind of built in method for
extracting data from a table, but I can't find one.

Thanks again, look forward to a reply
Adam
--
Posted via http://www.ruby-forum.com/\.

Adam_Dullenty · 7 January 2008 00:49

Steve Ross wrote:

I don't suppose the semantics can be improved any -- like class names
or ids?

Thanks for your reply. Afraid not, no handy names or ids. The code you
posted I think I was doing anyway in a slightly different form as
"elements2 = (elements/"table//table//td")". Since I posted last though
I've managed to sort it out just by lots of array manipulation.

Thanks for the help though
Adam

···

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Parsing html table cells ruby-talk	3	105	12 November 2006
Hpricot getting a table ruby-talk	4	94	18 April 2007
Using hpricot to get tables ruby-talk	2	139	1 July 2008
How to get data from html table ruby-talk	7	170	28 November 2006
Html parsing with Hpricot ruby-talk	2	106	9 June 2010

Using HPricot to parse a fiddly table

Related topics