REXML screen scraping questions

Dan_Kohn · 14 September 2005 11:56

My goal here is to take an HTML table and convert it into an array of
arrays, with each inner array representing the 5 columns of cells in a
given row and the outer array representing the whole table.

I'm using REXML to parse the DOM tree. I would appreciate suggestions
on cleaning up the code below. I've run into the following problems:

+ The result of the first XPath.match produces two root-level TR tags,
which causes REXML to fail on reparsing with an "attempted adding
second root element to document" error. My solution was to add
root-level <top> tags, but that's an ugly hack.

+ The biggest problem is that while XPath.match generates an array, the
REXML functions are no longer able to parse it. Instead, I settled on
the hack of converting the array to a string and then having REXML
reparse it. Is this really the best way to deal with recursive
parsing?

+ I can't create rowarray or tablearray because I get an
"xmlscrape.rb:45: undefined local variable or method `rowarray' for
main:Object (NameError)" error.

+ Ruby doesn't crash if I remove rowdom and just run the XPath on row.
However, I then get duplicates because it runs across the full DOM
tree, not just the portion of the tree I've selected in that loop. Is
there a way to have REXML realize that I want to work with a subset of
the tree, other than my too-complex string-conversion and reloading?

+ The :compress_whitespace directive does not seem to correctly realize
that newlines within a text entity are just regular whitespace and so
should be compressed. My solution was to use string.gsub to replace
all newlines with spaces at the start.

+ Some important text is inside <A> tags, but it's hard to remove a tag
while preserving the text inside. I finally got the replace_tag syntax
working and put it in a replace_tag method, so I'm good to go there.

I'm obviously new to Ruby, so any help you can offer on cleaning this
up would be greatly appreciated.

- dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

require "rexml/document"
include REXML
string = <<EOF
        <html>
        <tr>
        <td class="t4" nowrap="nowrap">9-Jan-05</td>
        <td class="t4"><a href="javascript:lu('OZ')">OZ</a> 0204 F
Class
        <a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
        ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
        LAX</a></td>
        <td class="t4" nowrap="nowrap">5,968</td>
        <td class="t4" nowrap="nowrap">2,984</td>
        <td class="t4" nowrap="nowrap">8,952</td>
        </tr>
        <tr>
        <td class="t4" nowrap="nowrap">19-Jan-05</td>
        <td class="t4">MILEAGE PLUS UPGRADE AWARD
        15,000 MILES</td>
        <td class="t4" nowrap="nowrap">-15,000</td>
        <td> </td>
        <td class="t4" nowrap="nowrap">-15,000</td>
        </tr>
  </html>
EOF

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside the
parent of the now removed tag
  while rexml_array.elements["//#{tag}"]
    rexml_array.elements["//#{tag}"].replace_with( Text.new(
rexml_array.elements["//#{tag}"].text.strip))
  end
end

doc = Document.new( string.gsub!(/\n| /," "), {
:compress_whitespace => :all } )
table = XPath.match( doc, "//tr[count(td)=5]")
#doc = Document.new File.new( "uamileage.html")
#rows = XPath.match( doc, "//tr[count(td)=5][position()=6 or
position()=7]")
table = "<top>", table, "</top>"
tabledom = Document.new( table.to_s)

XPath.each( tabledom,"/top/tr") { |row|
  rowdom= Document.new( row.to_s)
  XPath.each( rowdom,"//tr/td") { |cell|
    remove_tag( cell,"a")
    celltext = cell.texts.to_s
    print celltext,"\n"
# rowarray << celltext
    }
  puts "\n --- \n"
# tablearray << rowarray
  }

daz · 15 September 2005 04:51

Dan Kohn wrote:

My goal here is to take an HTML table and convert it into an array of
arrays, with each inner array representing the 5 columns of cells in a
given row and the outer array representing the whole table.

I'm using REXML to parse the DOM tree. I would appreciate suggestions
on cleaning up the code below. I've run into the following problems:

[snip problems]

Hiya Dan,

I know you've tried other packages but I think REXML isn't what
you want for sloppy old HTML work.

Below is an "htmltools" implementation using your input.
(If you've used Mechanize, html/tree may be installed, already.)

If you just want to search the data with regexen, you
might make a method which yields the data to its block
at the point where it's been collated.
(Post back if you need any help with that)

···

#-----------------------------------------------------------
require 'html/tree' # http://ruby-htmltools.rubyforge.org/

exa = HTMLTree::Parser.new(verbose=true, false)
#exa.parse_file_named('xxx.html')
exa.feed(string) # replacing '.parse_file_named'

# Pick out all <tr> tags under the <html> tag
exa.html.children.select {|e0| e0.tag == 'tr'}.each do |tr|
  # p [:line, __LINE__, tr.to_s]
  tdn = 0
  # Pick out all <td> tags under successive <tr> tags
  tr.select {|e1| e1.tag == 'td'}.each do |td|
    data = ''; tdn += 1
    # Inside the <td> are untagged data or nested tags
    td.each do |item|
      if item.data?
        # p [:data, item.to_s]
        data << item.to_s
      else
        # p [:line, __LINE__, item.tag, item.attributes]
      end
    end
    data.gsub!(/\s+/, ' ')
    ### yield data # ? (from a method)
    puts '#--td%02d--> %s' % [tdn, data]
  end
  puts '#' << '='*55
end

#--td01--> 9-Jan-05
#--td02--> OZ 0204 F Class ICN to LAX
#--td03--> 5,968
#--td04--> 2,984
#--td05--> 8,952
#=======================================================
#--td01--> 19-Jan-05
#--td02--> MILEAGE PLUS UPGRADE AWARD 15,000 MILES
#--td03--> -15,000
#--td04-->  
#--td05--> -15,000
#=======================================================

It's up to you to make the output _and_ the code look pretty.

daz

Dan_Kohn · 15 September 2005 08:56

Daz, thank you so much for taking the time to code that. I was also
busy today, and got my code working with REXML. Could you please take
a look at my code below and share your thoughts on whether you'd still
switch to htmltools.

The issue is that I'm creating a hundred different screen scrapers for
every frequent flyer program. Any scraper is, of course, brittle, but
it seemed to me like a DOM/XPath-based technique is both less likely to
break from small tweaks to the page and is also generally far more
concise to program. The downside, and it may be too big, is that my
code is awfully inefficient, and also requires that tidy be run on the
HTML before I start.

Also, since you're taking a look, could you please tell me if there's
any more concise way to initialize my arrays. (Ruby generally seems to
figure out variables, but this would only run if I explicitly used
Array.new.)

require "rexml/document"
include REXML
string = <<EOF
        <html>
        <tr>
        <td class="t4" nowrap="nowrap">9-Jan-05</td>
        <td class="t4"><a href="javascript:lu('OZ')">OZ</a> 0204 F
Class
        <a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
        ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
        LAX</a></td>
        <td class="t4" nowrap="nowrap">5,968</td>
        <td class="t4" nowrap="nowrap">2,984</td>
        <td class="t4" nowrap="nowrap">8,952</td>
        </tr>
        <tr>
        <td class="t4" nowrap="nowrap">19-Jan-05</td>
        <td class="t4">MILEAGE PLUS UPGRADE AWARD
        15,000 MILES</td>
        <td class="t4" nowrap="nowrap">-15,000</td>
        <td> </td>
        <td class="t4" nowrap="nowrap">-15,000</td>
        </tr>
        </html>
EOF

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside
# the parent of the now removed tag
while rexml_array.elements["//#{tag}"]
        rexml_array.elements["//#{tag}"].replace_with( Text.new(
                rexml_array.elements["//#{tag}"].text.strip))
        end
end

doc = Document.new( string.gsub!(/\n| /," "), {
        :compress_whitespace => :all } )
tablearray = Array.new
XPath.each( doc,"//tr[count(td)=5]") { |row|
        rowarray = Array.new
        rowdom = Document.new( row.to_s)
        XPath.each( rowdom,"//td") { |cell|
                remove_tag( cell,"a")
                rowarray << cell.texts.to_s
                }
       tablearray << rowarray
        }
tablearray.each {|el| print el.join(":"),"\n"}

Even better is some other scraping I do on the same page, where in each
case I only need a one-dimensional array:

XPath.each( xml, "//td[@class='t3'][2]") { |cell|
summaryarray << cell.texts.to_s }

XPath.each( xml,
"//td[@colspan='4']/child::*") { |cell|
actsumarray << cell.text.to_s }

Thanks again, Daz, for taking the time to look at my (first ever Ruby)
code. Any other suggestions you could offer would be greatly
appreciated.

- dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

Gavin_Kistner2 · 15 September 2005 14:00

Following is a sample of what I do for screen scraping - net::http and regex only. Just look for an indicative message on the screen, abstract it appropriately, and use it as an anchor for the data you want.

The following undocumented script hammers the WUnderground.com server to get min/max/average temperatures for a given city (airport code) for a given date range (optionally across many years). I used it (and Excel) to create
http://phrogz.net/tmp/BoulderTemperatures_LateSeptember.pdf
and
http://phrogz.net/tmp/CopperTemperatures_LateSeptember.pdf
(Trying to give a bunch of family members coming into town for a wedding a feel for the potential temperature ranges, and the variations possible within a given 5-day period.)

require 'net/http'
require 'date'

def get_temperatures( airport_code, date_range, year_range=nil )
   if year_range
     d1 = date_range.first
     d2 = date_range.last
     dates = year_range.collect { |year|
       ( Date.new( year, d1.mon, d1.day )..Date.new( year, d2.mon, d2.day ) ).to_a
     }.flatten
   else
     dates = date_range.to_a
   end

   Net::HTTP.start('www.wunderground.com', 80) { |http|
     dates.collect { |date|
       url = "/history/airport/#{airport_code}/#{date.year}/#{date.mon}/#{date.day}/DailyHistory.html"
       html = http.get( url ).body
       stats = { :min=>'Min', :max=>'Max', :mean=>'Mean' }
       stats.each { |key,val|
         if str = html[ %r{#{val}(?: Temp(?:erature))?</td>.+?<td[^>]*>(.+?)</td>}im , 1 ]
           temp = str[ %r{(\d+).+?\°F}i, 1 ]
           stats[ key ] = temp ? temp.to_f : nil
         else
           stats[ key ] = nil
         end
       }
       if stats[ :min ] && stats[ :max ]
         DayTemperature.new( date, stats[ :min ], stats[ :max ], stats[ :mean ] )
       end
     }.compact
   }
end

class DayTemperature
   attr_accessor :date, :min, :max, :mean
   def initialize( date, min, max, mean=nil )
     @date = date
     @min = min
     @max = max
     @mean = mean
   end
   def to_s
     "%s\t%3i\t%3i\t%4i" % [ "#{@date.year}-#{@date.mon}-#{@date.day}", @max, @mean, @min ]
   end
end

temps = get_temperatures( 'KBJC', Date.new( 2000, 8, 15 )..Date.new( 2000, 9, 15 ), 1990..2005 )
puts temps.join( "\n" )

···

On Sep 14, 2005, at 10:51 PM, daz wrote:

If you just want to search the data with regexen, you
might make a method which yields the data to its block
at the point where it's been collated.
(Post back if you need any help with that)

daz · 15 September 2005 13:51

Dan Kohn wrote:

[...]
The issue is that I'm creating a hundred different screen scrapers for
every frequent flyer program. Any scraper is, of course, brittle, but
it seemed to me like a DOM/XPath-based technique is both less likely to
break from small tweaks to the page and is also generally far more
concise to program. The downside, and it may be too big, is that my
code is awfully inefficient, and also requires that tidy be run on the
HTML before I start.

Hi Dan,

Your code, IMHO, is inefficient due to the use of 'industrial grade'
software for a lightweight task, not from your coding.
I've run traces on REXML progs and the detailed work it carries out
is quite incredible (and necessary for its power).
Estimating conservatively, from timing and profiling of comparable
scripts, I'd say that I could run 15 pages through 'tools' to each
going through REXML ... probably as many as 30 ... even more while
you're pre-processing with Tidy.

Also, since you're taking a look, could you please tell me if there's
any more concise way to initialize my arrays. (Ruby generally seems to
figure out variables, but this would only run if I explicitly used
Array.new.)

That's not a factor

Thanks again, Daz, for taking the time to look at my (first ever Ruby)
code. Any other suggestions you could offer would be greatly
appreciated.

- dan

Glad to help.

Just one suggestion; your REXML experience won't be wasted --
don't hesitate to use REXML when it's needed (or at the weekends
- it is /class/, as you know.
For this specific task, with speed being important, you need to use
a lighter package. I've used only one for any length of time, so I
can't compare with others.
Many folks would tackle this job with hand-parsing/regexps or this:
http://raa.ruby-lang.org/project/htmltokenizer/ - which may offer
you even better performance.

# Script used for timing comparisons against your latest.

···

#--------------------------------------------------------
exa = HTMLTree::Parser.new(verbose=true, ws=false)
exa.feed(string) # replacing '.parse_file_named'

tablearray =
exa.html.children.select {|e0| e0.tag == 'tr'}.each do |tr|
  rowarray =
  tr.select {|e1| e1.tag == 'td'}.each do |td|
    data = ''
    td.each do |item|
      data << item.to_s if item.data?
    end
    data.gsub!(/(\s| )+/, ' ')
    rowarray << data
  end
  tablearray << rowarray
end
tablearray.each {|el| puts el.join(":")}
#-------------------------------------------------------------------
9-Jan-05:OZ 0204 F Class ICN to LAX:5,968:2,984:8,952
19-Jan-05:MILEAGE PLUS UPGRADE AWARD 15,000 MILES:-15,000: :-15,000
#-------------------------------------------------------------------

Cheers,

daz
--

BTW, 'tools' does a similar job to Tidy (outputting to REXML format !):

  require 'html/xpath' # http://ruby-htmltools.rubyforge.org/
  exa = HTMLTree::Parser.new(verbose=false, strip_white=false)
  exa.feed(string)
  puts exa.tree.as_rexml_document

Topic		Replies	Views
Screen scraping via regex vs. htmltools (vs. REXML) ruby-talk	5	102	2 December 2005
REXML question ruby-talk	1	73	24 January 2006
Screen scraping using htmltools and rexml ruby-talk	0	119	24 January 2006
Screenscraping using htmltools and rexml ruby-talk	2	103	24 January 2006
REXML to extract only values from XML? ruby-talk	3	109	29 September 2005

REXML screen scraping questions

Related topics