My goal here is to take an HTML table and convert it into an array of
arrays, with each inner array representing the 5 columns of cells in a
given row and the outer array representing the whole table.
I'm using REXML to parse the DOM tree. I would appreciate suggestions
on cleaning up the code below. I've run into the following problems:
+ The result of the first XPath.match produces two root-level TR tags,
which causes REXML to fail on reparsing with an "attempted adding
second root element to document" error. My solution was to add
root-level <top> tags, but that's an ugly hack.
+ The biggest problem is that while XPath.match generates an array, the
REXML functions are no longer able to parse it. Instead, I settled on
the hack of converting the array to a string and then having REXML
reparse it. Is this really the best way to deal with recursive
parsing?
+ I can't create rowarray or tablearray because I get an
"xmlscrape.rb:45: undefined local variable or method `rowarray' for
main:Object (NameError)" error.
+ Ruby doesn't crash if I remove rowdom and just run the XPath on row.
However, I then get duplicates because it runs across the full DOM
tree, not just the portion of the tree I've selected in that loop. Is
there a way to have REXML realize that I want to work with a subset of
the tree, other than my too-complex string-conversion and reloading?
+ The :compress_whitespace directive does not seem to correctly realize
that newlines within a text entity are just regular whitespace and so
should be compressed. My solution was to use string.gsub to replace
all newlines with spaces at the start.
+ Some important text is inside <A> tags, but it's hard to remove a tag
while preserving the text inside. I finally got the replace_tag syntax
working and put it in a replace_tag method, so I'm good to go there.
I'm obviously new to Ruby, so any help you can offer on cleaning this
up would be greatly appreciated.
- dan
···
--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>
require "rexml/document"
include REXML
string = <<EOF
<html>
<tr>
<td class="t4" nowrap="nowrap">9-Jan-05</td>
<td class="t4"><a href="javascript:lu('OZ')">OZ</a> 0204 F
Class
<a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
ICN</a> to <a
href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
LAX</a></td>
<td class="t4" nowrap="nowrap">5,968</td>
<td class="t4" nowrap="nowrap">2,984</td>
<td class="t4" nowrap="nowrap">8,952</td>
</tr>
<tr>
<td class="t4" nowrap="nowrap">19-Jan-05</td>
<td class="t4">MILEAGE PLUS UPGRADE AWARD
15,000 MILES</td>
<td class="t4" nowrap="nowrap">-15,000</td>
<td> </td>
<td class="t4" nowrap="nowrap">-15,000</td>
</tr>
</html>
EOF
def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside the
parent of the now removed tag
while rexml_array.elements["//#{tag}"]
rexml_array.elements["//#{tag}"].replace_with( Text.new(
rexml_array.elements["//#{tag}"].text.strip))
end
end
doc = Document.new( string.gsub!(/\n| /," "), {
:compress_whitespace => :all } )
table = XPath.match( doc, "//tr[count(td)=5]")
#doc = Document.new File.new( "uamileage.html")
#rows = XPath.match( doc, "//tr[count(td)=5][position()=6 or
position()=7]")
table = "<top>", table, "</top>"
tabledom = Document.new( table.to_s)
XPath.each( tabledom,"/top/tr") { |row|
rowdom= Document.new( row.to_s)
XPath.each( rowdom,"//tr/td") { |cell|
remove_tag( cell,"a")
celltext = cell.texts.to_s
print celltext,"\n"
# rowarray << celltext
}
puts "\n --- \n"
# tablearray << rowarray
}