Pulling text from elements with REXML

Paul_Willis · 19 March 2007 11:38

Hi

I am using REXML to pull text from a NewsML document.

require 'rexml/document'
include REXML
file = File.new("Main_News.xml")
doc = Document.new(file)
root = doc.root
puts
root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/ContentItem/DataContent/nitf/body/body.head/hedline/hl1"]

Gives me...

<hl1>Blueprint to cut emissions unveiled</hl1>

Is there an easy way (ie something in REXML) to pull just the text
without the containers <hl1> and </hl1>.

Paul

···

--
Posted via http://www.ruby-forum.com/.

Peter_Szinek3 · 19 March 2007 11:50

Paul Willis wrote:

Hi

I am using REXML to pull text from a NewsML document.

require 'rexml/document'
include REXML
file = File.new("Main_News.xml")
doc = Document.new(file)
root = doc.root
puts
root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/ContentItem/DataContent/nitf/body/body.head/hedline/hl1"]

Gives me...

<hl1>Blueprint to cut emissions unveiled</hl1>

Is there an easy way (ie something in REXML) to pull just the text
without the containers <hl1> and </hl1>.

If I understood correctly, you need the text content of the node rather than the whole node. This can be accomplished with:

some_element.text

so you could do something like

root.elements[...your stuff_here...].to_a.each {|e| puts e.text}

HTH,
Peter

···

__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

Gavin_Kistner2 · 19 March 2007 14:05

require 'rexml/document'
doc = REXML::Document.new("<root><kid>hello world</kid></root>")
p REXML::XPath.first( doc, '/root/kid/text()' )
#=> "hello world"

···

On Mar 19, 5:38 am, Paul Willis <i...@paulwillis.com> wrote:

root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/Conten tItem/DataContent/nitf/body/body.head/hedline/hl1"]

Gives me...

<hl1>Blueprint to cut emissions unveiled</hl1>

Is there an easy way (ie something in REXML) to pull just the text
without the containers <hl1> and </hl1>.

Paul_Willis · 19 March 2007 12:06

Peter Szinek wrote:

If I understood correctly, you need the text content of the node rather
than the whole node. This can be accomplished with:

some_element.text

You did understand correctly, .text on the end was all I needed.

Cheers

Paul

···

--
Posted via http://www.ruby-forum.com/\.

Gavin_Kistner2 · 19 March 2007 14:25

Also, depending on your needs:

  include REXML
  doc = Document.new("<root><kid>hello</kid><kid>world</kid></root>")
  p XPath.match( doc, '/root/kid/text()' )
  #=> ["hello", "world"]

···

On Mar 19, 8:04 am, "Phrogz" <g...@refinery.com> wrote:

On Mar 19, 5:38 am, Paul Willis <i...@paulwillis.com> wrote:

> root.elements["NewsItem/NewsComponent/NewsComponent[1]/NewsComponent/Conten tItem/DataContent/nitf/body/body.head/hedline/hl1"]

> Gives me...

> <hl1>Blueprint to cut emissions unveiled</hl1>

> Is there an easy way (ie something in REXML) to pull just the text
> without the containers <hl1> and </hl1>.

require 'rexml/document'
doc = REXML::Document.new("<root><kid>hello world</kid></root>")
p REXML::XPath.first( doc, '/root/kid/text()' )
#=> "hello world"

Paul_Willis · 22 March 2007 16:44

require 'rexml/document'
doc = REXML::Document.new("<root><kid>hello world</kid></root>")
p REXML::XPath.first( doc, '/root/kid/text()' )
#=> "hello world"

Thanks for that, I'm now using REXML::XPath with a combination of .first
and .match to pull the element text out.

One more thing, given an XML document...

<root><kid stuff="some-other-text">hello world</kid></root>

What would be the path to the attribute 'stuff' and return
'some-other-text'?

Paul

···

--
Posted via http://www.ruby-forum.com/\.

Keith_Fahlgren · 19 March 2007 15:38

Hey,

Two notes:
1. I always suggest the REXML::XPath methods over the others for
people who grok XPath.
2. A REXML::XPath.* ... text() match will return a REXML::Text node,
which may _not_ be what you want:

$ irb --simple-prompt foo.rb

require 'rexml/document'

=> true

doc = REXML::Document.new("<root><kid>hello world</kid></root>")

=> <UNDEFINED> ... </>

REXML::XPath.first( doc, '/root/kid/text()' )

=> "hello world"

REXML::XPath.first( doc, '/root/kid/text()' ).class

=> REXML::Text

Just something to be aware of (use .to_s if you want a string, as usual).

HTH,
Keith

Gavin_Kistner2 · 22 March 2007 16:55

require 'rexml/document'
include REXML
doc = Document.new( <<ENDDOC )
<root>
<kid stuff="some-other-text">hello world</kid>
<kid class="best" stuff="gibbles">hello world</kid>
</root>
ENDDOC

att = XPath.first( doc, '//kid/@stuff' )
p att, att.class, att.value
#=> stuff='some-other-text'
#=> REXML::Attribute
#=> "some-other-text"

p XPath.first( doc, '//kid[@class="best"]/@stuff' ).value
#=> "gibbles"

I don't know what the XPath syntax is to select the value of an
attribute directly. I'd be interested to know if someone else knows it.

···

On Mar 22, 10:44 am, Paul Willis <i...@paulwillis.com> wrote:

One more thing, given an XML document...

<root><kid stuff="some-other-text">hello world</kid></root>

What would be the path to the attribute 'stuff' and return
'some-other-text'?

Paul_Willis · 22 March 2007 17:02

Gavin Kistner wrote:

att = XPath.first( doc, '//kid/@stuff' )

I don't know what the XPath syntax is to select the value of an
attribute directly. I'd be interested to know if someone else knows it.

Cheers, it was the kid/@stuff I needed...

puts XPath.first( doc, '/root/kid/@stuff' )

#=> some-other-text

Paul

···

--
Posted via http://www.ruby-forum.com/\.

Gavin_Kistner2 · 22 March 2007 17:10

Nice, I didn't realize that REXML::Attribute had such different output
for #inspect versus #to_s. It's nice, then, that you don't need to
call .value in this particular case. Just be aware that without
the .value call you still have an Attribute instance that can just be
treated as a string in some areas:

att = XPath.first( doc, '//kid/@stuff' )

puts att
#=> some-other-text

puts att.value + '-more'
#=> some-other-text-more

  puts att + "-more"
  #=> tmp.rb:17: undefined method `+' for
      stuff='some-other-text':REXML::Attribute (NoMethodError)

···

On Mar 22, 11:02 am, Paul Willis <i...@paulwillis.com> wrote:

Gavin Kistner wrote:
> att = XPath.first( doc, '//kid/@stuff' )
> I don't know what the XPath syntax is to select the value of an
> attribute directly. I'd be interested to know if someone else knows it.

Cheers, it was the kid/@stuff I needed...

puts XPath.first( doc, '/root/kid/@stuff' )

#=> some-other-text

Topic		Replies	Views
Rexml: generating tree from source ruby-talk	3	133	29 October 2005
Rexml child nodes ruby-talk	5	249	7 June 2005
REXML feature request: XPath.match.text & better text documentation ruby-talk	2	97	15 September 2005
Rexml - get raw xml of elements and text ruby-talk	2	120	22 December 2005
REXML element reading <br /> error ruby-talk	4	108	1 September 2007

Pulling text from elements with REXML

Related topics