How to get REXML to return items in order?

ted2 · 3 September 2005 21:16

Hi,

I'm new to Ruby and can't figure out why REXML isn't returning the elements
in the order they appear in the document. Here's my code and the document.
Any help appreciated.

Thanks,
Ted

···

#==============================
# ruby
#==============================
xml = REXML::Document.new(File.open("test.html"));
xml.elements.each("//span[@class='c5']") do |element|
puts element
end

#==============================
# the "test.html" file
#==============================
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>

Gavin_Kistner2 · 3 September 2005 23:40

I'm new to Ruby and can't figure out why REXML isn't returning the elements
in the order they appear in the document. Here's my code and the document.

I confirm the problem. Looks like a bug. If I remove some of the anchors, it works.
(Off-topic - no need to use empty named anchors in your page - just use IDs on existing elements instead.)

Sliver:~/Desktop] gkistner$ cat tmp.rb
code = <<ENDHTML
<html><body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></

<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></

<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr></

</body></html>
ENDHTML

require 'rexml/document'
xml = REXML::Document.new( code );
xml.elements.each( "//span[@class='c5']" ) do |element|
puts element
end

[Sliver:~/Desktop] gkistner$ ruby -v tmp.rb
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.2]
<span class='c5'><b>3rd Title</b></span>
<span class='c5'><b>1st Title</b></span>
<span class='c5'><b>2nd Title</b></span>

···

On Sep 3, 2005, at 3:16 PM, ted wrote:

ted2 · 4 September 2005 00:31

Thanks Gavin. Unfortunately I can't remove the anchors. The html is just a
sample of the documents (not my docs) that I'm given to parse. Someone on
IRC mentioned that XPath 1.0 doesn't guarantee the order of elements.

"Gavin Kistner" <gavin@refinery.com> wrote in message
news:6A73B666-6668-430A-8C58-96DC8294A970@refinery.com...

···

On Sep 3, 2005, at 3:16 PM, ted wrote:

I'm new to Ruby and can't figure out why REXML isn't returning the
elements
in the order they appear in the document. Here's my code and the
document.

I confirm the problem. Looks like a bug. If I remove some of the
anchors, it works.
(Off-topic - no need to use empty named anchors in your page - just
use IDs on existing elements instead.)

Sliver:~/Desktop] gkistner$ cat tmp.rb
code = <<ENDHTML
<html><body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></
>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></
>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr></
>
</body></html>
ENDHTML

require 'rexml/document'
xml = REXML::Document.new( code );
xml.elements.each( "//span[@class='c5']" ) do |element|
puts element
end

[Sliver:~/Desktop] gkistner$ ruby -v tmp.rb
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.2]
<span class='c5'><b>3rd Title</b></span>
<span class='c5'><b>1st Title</b></span>
<span class='c5'><b>2nd Title</b></span>

daz · 4 September 2005 01:41

Gavin Kistner wrote:

> I'm new to Ruby and can't figure out why REXML isn't returning the
> elements
> in the order they appear in the document. Here's my code and the
> document.

I confirm the problem. Looks like a bug. [...]

.... and it's fixed in CVS for 1.8.3

If you need this now, you could download the later version here:
http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/ruby/lib/rexml/rexml.tar.gz?only_with_tag=ruby_1_8;tarball=1

to e.g. "C:\Ruby\TEMP" then change the lookup path at the top of your script.

$:.unshift('C:/Ruby/TEMP') # for rexml fixes
require 'rexml/document'
xml = REXML::Document.new(DATA)
xml.elements.each("//span[@class='c5']") do |element|
puts element
end

#-> <span class='c5'><b>1st Title</b></span>
#-> <span class='c5'><b>2nd Title</b></span>
#-> <span class='c5'><b>3rd Title</b></span>

__END__
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>

daz

···

On Sep 3, 2005, at 3:16 PM, ted wrote:

David_A_Black3 · 4 September 2005 01:19

Hi --

···

On Sun, 4 Sep 2005, ted wrote:

Thanks Gavin. Unfortunately I can't remove the anchors. The html is just a
sample of the documents (not my docs) that I'm given to parse. Someone on
IRC mentioned that XPath 1.0 doesn't guarantee the order of elements.

I would be astonished if Sean Russell had combed through the 1.0 spec
to find some loophole that made it plausible to have an iteration not
follow document order. I could be wrong but I think it's more likely
a REXML bug.

David

--
David A. Black
dblack@wobblini.net

ted2 · 4 September 2005 04:11

Thanks daz.

"daz" <dooby@d10.karoo.co.uk> wrote in message
news:YTWdnRhBUv9S0ofeSa8jmw@karoo.co.uk...

···

Gavin Kistner wrote:

On Sep 3, 2005, at 3:16 PM, ted wrote:
> I'm new to Ruby and can't figure out why REXML isn't returning the
> elements
> in the order they appear in the document. Here's my code and the
> document.

I confirm the problem. Looks like a bug. [...]

... and it's fixed in CVS for 1.8.3

If you need this now, you could download the later version here:
http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/ruby/lib/rexml/rexml.tar.gz?only_with_tag=ruby_1_8;tarball=1

to e.g. "C:\Ruby\TEMP" then change the lookup path at the top of your
script.

$:.unshift('C:/Ruby/TEMP') # for rexml fixes
require 'rexml/document'
xml = REXML::Document.new(DATA)
xml.elements.each("//span[@class='c5']") do |element|
puts element
end

#-> <span class='c5'><b>1st Title</b></span>
#-> <span class='c5'><b>2nd Title</b></span>
#-> <span class='c5'><b>3rd Title</b></span>

__END__
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>

daz

Dan_Kohn · 15 September 2005 07:01

I just wanted to mention that I encountered the same bug and that the
new version of the library fixed it for me. Thank you very much for
the clear instructions. If only for pay products had support that was
this good....

- dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

Dan_Kohn · 15 September 2005 19:46

Daz, there's a bug in the CVS version of REXML. The following code
produces the error below, but works perfectly with the default 1.8.2
REXML (i.e., when I comment out the first line).

ruby rexmlbug.rb

C:/Dan/dev/rexml/xpath_parser.rb:157:in `expr': undefined method
`delete_if' for nil:NilClass (NoMethodError)
  from C:/Dan/dev/rexml/xpath_parser.rb:481:in `d_o_s'
  from C:/Dan/dev/rexml/xpath_parser.rb:478:in `each_index'
  from C:/Dan/dev/rexml/xpath_parser.rb:478:in `d_o_s'
  from C:/Dan/dev/rexml/xpath_parser.rb:469:in `descendant_or_self'
  from C:/Dan/dev/rexml/xpath_parser.rb:314:in `expr'
  from C:/Dan/dev/rexml/xpath_parser.rb:125:in `match'
  from C:/Dan/dev/rexml/xpath_parser.rb:56:in `parse'
  from C:/Dan/dev/rexml/xpath.rb:53:in `each'
  from rexmlbug.rb:28

Exit code: 1

$:.unshift('C:/Dan/dev') # for rexml fixes
require "rexml/document"
include REXML
string = <<EOF
        <html>
        <td class="t4"><a href="javascript:lu('OZ')">OZ</a>
        0204 F Class
        <a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
        ICN</a> to <a
        href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
        LAX</a></td>
        <tr>
        <td class="t4"><font color="white">UNITED</font></td>
        <td colspan="4" align="right">
        <strong>48,164</strong></td>
        </tr>
        <tr>
        <td class="t4"><font color="white">Star
        Alliance</font></td>
        <td colspan="4" align="right">
        <strong>49,072</strong></td>
        </tr>
        </html>
EOF

doc = Document.new string.gsub!(/\s+| /," ")
array = Array.new
XPath.each( doc, "//td[@colspan='4']/preceding-sibling::td/child::*") {

cell>

array << cell.texts.to_s }
puts array

daz · 17 September 2005 01:16

Dan Kohn wrote:

Daz, there's a bug in the CVS version of REXML.
The following code produces the error below, but works
perfectly with the default 1.8.2 REXML [...]

Thanks Dan.

The REXML code in that area looks quite "fluid" and
there are clear warnings to "turn away" (which I heeded ;).

I've filed a bug report which you might want to check over
and then keep a watch on.

http://www.germane-software.com/projects/rexml/ticket/32

Cheers,

daz

Topic		Replies	Views
Ordering of REXML::XPath#match result ruby-talk	0	109	23 October 2004
REXML XPath not iterating in Source Order ruby-talk	0	108	9 February 2005
REXML::XPath results out of sort order? ruby-talk	8	119	25 October 2005
Attribute Ordering in REXML ruby-talk	2	61	19 March 2007
REXML XPath bug? ruby-talk	1	71	17 May 2008

How to get REXML to return items in order?

Related topics