REXML feature request: XPath.match.text & better text documentation

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.

I believe the following shows why it would be useful, but please let me
know if this isn't clear enough.

require "rexml/document"
include REXML
string = <<EOF
        <html>
        <td class="t4"><a href="javascript:lu('OZ')">OZ</a>
        0204 F Class
        <a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
        ICN</a> to <a
        href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
        LAX</a></td>
        <tr>
        <td class="t4"><font color="white">UNITED</font></td>
        <td colspan="4" align="right">
        <strong>48,164</strong></td>
        </tr>
        <tr>
        <td class="t4"><font color="white">Star
        Alliance</font></td>
        <td colspan="4" align="right">
        <strong>49,072</strong></td>
        </tr>
        </html>
EOF
doc = Document.new string.gsub!(/\s+|&nbsp;/," ")

#This works fine:
actsumarray = Array.new
XPath.each( doc,
        "//td[@colspan='4']/child::*") { |cell|
        actsumarray << cell.text.to_s }
puts actsumarray # 48,164 & 49,072

# But either of these would be much more convenient:
# actsumarray = Xpath.match.text ( doc, "//td[@colspan='4']/child::*")
# actsumarray = doc.elements.text.to_a( "//td[@colspan='4']/child::*")

# Converting to text is also pretty confusing.
# You might consider adding a method like
# remove_tag (which should be enhanced to support
# multiple tags). I suspect others would find it useful.

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside
# the parent of the now removed tag
while rexml_array.elements["//#{tag}"]
        rexml_array.elements["//#{tag}"].replace_with( Text.new(
                rexml_array.elements["//#{tag}"].text.strip))
        end
end

# These sorts of examples would be great for the documentation
# to show how much the results can vary.
cell = doc.elements["//td[@class='t4']"]
puts cell #[ugly HTML]
puts cell.text.to_s # 0204 F Class
puts cell.texts.to_s # 0204 F Class to
remove_tag( cell, "a") #<td class='t4'>OZ 0204\
puts cell #F Class ICN to LAX</td>
puts cell.text.to_s #OZ
puts cell.texts.to_s #OZ 0204 F Class ICN to LAX

         - dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

One aside - you might like to know about:

doc = Document.new( string, :ignore_whitespace_nodes => :all )

···

On Sep 15, 2005, at 3:56 AM, Dan Kohn wrote:

doc = Document.new string.gsub!(/\s+|&nbsp;/," ")

Does this help you?

require 'rexml/document'
include REXML

d = Document.new <<ENDXML
<root>
   <foo>Raw text</foo>
   <foo>Raw text2</foo>
   <foo>AA <bar>Nested Text</bar>ZZ</foo>
</root>
ENDXML

p XPath.match( d, '//foo//text()' ).collect{ |textnode|
   textnode.value
}
#=> ["Raw text", "Raw text2", "AA", "Nested Text", "ZZ"]

class REXML::Element
   def inner_text
     self.each_element( './/text()' ){}.join( '' )
   end
end

p XPath.match( d, '//foo' ).collect{ |foo|
   foo.inner_text
}
#=> ["Raw text", "Raw text2", "AA Nested TextZZ"]

···

On Sep 15, 2005, at 3:56 AM, Dan Kohn wrote:

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.