REXML feature request: XPath.match.text & better text documentation

Dan_Kohn · 15 September 2005 09:56

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.

I believe the following shows why it would be useful, but please let me
know if this isn't clear enough.

require "rexml/document"
include REXML
string = <<EOF
 <html>
 <td class="t4"><a href="javascript:lu('OZ')">OZ</a>
 0204 F Class
 <a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
 ICN</a> to <a
 href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
 LAX</a></td>
 <tr>
 <td class="t4">UNITED</td>
 <td colspan="4" align="right">
 48,164</td>
 </tr>
 <tr>
 <td class="t4">Star
 Alliance</td>
 <td colspan="4" align="right">
 49,072</td>
 </tr>
 </html>
EOF
doc = Document.new string.gsub!(/\s+| /," ")

#This works fine:
actsumarray = Array.new
XPath.each( doc,
"//td[@colspan='4']/child::*") { |cell|
actsumarray << cell.text.to_s }
puts actsumarray # 48,164 & 49,072

# But either of these would be much more convenient:
# actsumarray = Xpath.match.text ( doc, "//td[@colspan='4']/child::*")
# actsumarray = doc.elements.text.to_a( "//td[@colspan='4']/child::*")

# Converting to text is also pretty confusing.
# You might consider adding a method like
# remove_tag (which should be enhanced to support
# multiple tags). I suspect others would find it useful.

def remove_tag( rexml_array,tag)
# Removes tag but leaves the text inside the tag as text inside
# the parent of the now removed tag
while rexml_array.elements["//#{tag}"]
        rexml_array.elements["//#{tag}"].replace_with( Text.new(
                rexml_array.elements["//#{tag}"].text.strip))
        end
end

# These sorts of examples would be great for the documentation
# to show how much the results can vary.
cell = doc.elements["//td[@class='t4']"]
puts cell #[ugly HTML]
puts cell.text.to_s # 0204 F Class
puts cell.texts.to_s # 0204 F Class to
remove_tag( cell, "a") #<td class='t4'>OZ 0204\
puts cell #F Class ICN to LAX</td>
puts cell.text.to_s #OZ
puts cell.texts.to_s #OZ 0204 F Class ICN to LAX

- dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

Gavin_Kistner2 · 15 September 2005 13:35

One aside - you might like to know about:

doc = Document.new( string, :ignore_whitespace_nodes => :all )

···

On Sep 15, 2005, at 3:56 AM, Dan Kohn wrote:

doc = Document.new string.gsub!(/\s+| /," ")

Gavin_Kistner2 · 15 September 2005 14:19

Does this help you?

require 'rexml/document'
include REXML

d = Document.new <<ENDXML
<root>
 <foo>Raw text</foo>
 <foo>Raw text2</foo>
 <foo>AA <bar>Nested Text</bar>ZZ</foo>
</root>
ENDXML

p XPath.match( d, '//foo//text()' ).collect{ |textnode|
textnode.value
}
#=> ["Raw text", "Raw text2", "AA", "Nested Text", "ZZ"]

class REXML::Element
   def inner_text
     self.each_element( './/text()' ){}.join( '' )
   end
end

p XPath.match( d, '//foo' ).collect{ |foo|
foo.inner_text
}
#=> ["Raw text", "Raw text2", "AA Nested TextZZ"]

···

On Sep 15, 2005, at 3:56 AM, Dan Kohn wrote:

Sean, et al, thanks for a great piece of software in REXML. I would
appreciate if you would consider adding the text and texts method to
XPath and Elements.

Topic		Replies	Views
Pulling text from elements with REXML ruby-talk	9	129	22 March 2007
Adding any <text/> to a rexml doc (bump) ruby-talk	8	131	16 December 2005
REXML - text nodes ruby-talk	2	78	13 August 2003
Matching nodes using rexml ruby-talk	3	85	16 April 2003
Problems with text() and REXML ruby-talk	1	109	20 January 2009

REXML feature request: XPath.match.text & better text documentation

Related topics