How to get REXML to return items in order?

Hi,

I'm new to Ruby and can't figure out why REXML isn't returning the elements
in the order they appear in the document. Here's my code and the document.
Any help appreciated.

Thanks,
Ted

···

#==============================
# ruby
#==============================
xml = REXML::Document.new(File.open("test.html"));
xml.elements.each("//span[@class='c5']") do |element|
    puts element
end

#==============================
# the "test.html" file
#==============================
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>

I'm new to Ruby and can't figure out why REXML isn't returning the elements
in the order they appear in the document. Here's my code and the document.

I confirm the problem. Looks like a bug. If I remove some of the anchors, it works.
(Off-topic - no need to use empty named anchors in your page - just use IDs on existing elements instead.)

Sliver:~/Desktop] gkistner$ cat tmp.rb
code = <<ENDHTML
<html><body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></

<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></

<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr></

</body></html>
ENDHTML

require 'rexml/document'
xml = REXML::Document.new( code );
xml.elements.each( "//span[@class='c5']" ) do |element|
     puts element
end

[Sliver:~/Desktop] gkistner$ ruby -v tmp.rb
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.2]
<span class='c5'><b>3rd Title</b></span>
<span class='c5'><b>1st Title</b></span>
<span class='c5'><b>2nd Title</b></span>

···

On Sep 3, 2005, at 3:16 PM, ted wrote:

Thanks Gavin. Unfortunately I can't remove the anchors. The html is just a
sample of the documents (not my docs) that I'm given to parse. Someone on
IRC mentioned that XPath 1.0 doesn't guarantee the order of elements.

"Gavin Kistner" <gavin@refinery.com> wrote in message
news:6A73B666-6668-430A-8C58-96DC8294A970@refinery.com...

···

On Sep 3, 2005, at 3:16 PM, ted wrote:

I'm new to Ruby and can't figure out why REXML isn't returning the
elements
in the order they appear in the document. Here's my code and the
document.

I confirm the problem. Looks like a bug. If I remove some of the
anchors, it works.
(Off-topic - no need to use empty named anchors in your page - just
use IDs on existing elements instead.)

Sliver:~/Desktop] gkistner$ cat tmp.rb
code = <<ENDHTML
<html><body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></
>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></
>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr></
>
</body></html>
ENDHTML

require 'rexml/document'
xml = REXML::Document.new( code );
xml.elements.each( "//span[@class='c5']" ) do |element|
    puts element
end

[Sliver:~/Desktop] gkistner$ ruby -v tmp.rb
ruby 1.8.2 (2004-12-25) [powerpc-darwin7.7.2]
<span class='c5'><b>3rd Title</b></span>
<span class='c5'><b>1st Title</b></span>
<span class='c5'><b>2nd Title</b></span>

Gavin Kistner wrote:

> I'm new to Ruby and can't figure out why REXML isn't returning the
> elements
> in the order they appear in the document. Here's my code and the
> document.

I confirm the problem. Looks like a bug. [...]

.... and it's fixed in CVS for 1.8.3

If you need this now, you could download the later version here:
http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/ruby/lib/rexml/rexml.tar.gz?only_with_tag=ruby_1_8;tarball=1

to e.g. "C:\Ruby\TEMP" then change the lookup path at the top of your script.

$:.unshift('C:/Ruby/TEMP') # for rexml fixes
require 'rexml/document'
xml = REXML::Document.new(DATA)
xml.elements.each("//span[@class='c5']") do |element|
  puts element
end

#-> <span class='c5'><b>1st Title</b></span>
#-> <span class='c5'><b>2nd Title</b></span>
#-> <span class='c5'><b>3rd Title</b></span>

__END__
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>

daz

···

On Sep 3, 2005, at 3:16 PM, ted wrote:

Hi --

···

On Sun, 4 Sep 2005, ted wrote:

Thanks Gavin. Unfortunately I can't remove the anchors. The html is just a
sample of the documents (not my docs) that I'm given to parse. Someone on
IRC mentioned that XPath 1.0 doesn't guarantee the order of elements.

I would be astonished if Sean Russell had combed through the 1.0 spec
to find some loophole that made it plausible to have an iteration not
follow document order. I could be wrong but I think it's more likely
a REXML bug.

David

--
David A. Black
dblack@wobblini.net

Thanks daz.

"daz" <dooby@d10.karoo.co.uk> wrote in message
news:YTWdnRhBUv9S0ofeSa8jmw@karoo.co.uk...

···

Gavin Kistner wrote:

On Sep 3, 2005, at 3:16 PM, ted wrote:
> I'm new to Ruby and can't figure out why REXML isn't returning the
> elements
> in the order they appear in the document. Here's my code and the
> document.

I confirm the problem. Looks like a bug. [...]

... and it's fixed in CVS for 1.8.3

If you need this now, you could download the later version here:
http://www.ruby-lang.org/cgi-bin/cvsweb.cgi/ruby/lib/rexml/rexml.tar.gz?only_with_tag=ruby_1_8;tarball=1

to e.g. "C:\Ruby\TEMP" then change the lookup path at the top of your
script.

$:.unshift('C:/Ruby/TEMP') # for rexml fixes
require 'rexml/document'
xml = REXML::Document.new(DATA)
xml.elements.each("//span[@class='c5']") do |element|
puts element
end

#-> <span class='c5'><b>1st Title</b></span>
#-> <span class='c5'><b>2nd Title</b></span>
#-> <span class='c5'><b>3rd Title</b></span>

__END__
<html>
<body>
<a name="1"/>
<table><tr><td><span class="c5"><b>1st Title</b></span></td></tr></table>
<a name="2"/>
<table><tr><td><span class="c5"><b>2nd Title</b></span></td></tr></table>
<a name="3"/>
<table><tr><td><span class="c5"><b>3rd Title</b></span></td></tr>
</table>
</body>
</html>

daz

I just wanted to mention that I encountered the same bug and that the
new version of the library fixed it for me. Thank you very much for
the clear instructions. If only for pay products had support that was
this good....

         - dan

···

--
Dan Kohn <mailto:dan@dankohn.com>
<http://www.dankohn.com/> <tel:+1-415-233-1000>

Daz, there's a bug in the CVS version of REXML. The following code
produces the error below, but works perfectly with the default 1.8.2
REXML (i.e., when I comment out the first line).

ruby rexmlbug.rb

C:/Dan/dev/rexml/xpath_parser.rb:157:in `expr': undefined method
`delete_if' for nil:NilClass (NoMethodError)
  from C:/Dan/dev/rexml/xpath_parser.rb:481:in `d_o_s'
  from C:/Dan/dev/rexml/xpath_parser.rb:478:in `each_index'
  from C:/Dan/dev/rexml/xpath_parser.rb:478:in `d_o_s'
  from C:/Dan/dev/rexml/xpath_parser.rb:469:in `descendant_or_self'
  from C:/Dan/dev/rexml/xpath_parser.rb:314:in `expr'
  from C:/Dan/dev/rexml/xpath_parser.rb:125:in `match'
  from C:/Dan/dev/rexml/xpath_parser.rb:56:in `parse'
  from C:/Dan/dev/rexml/xpath.rb:53:in `each'
  from rexmlbug.rb:28

Exit code: 1

$:.unshift('C:/Dan/dev') # for rexml fixes
require "rexml/document"
include REXML
string = <<EOF
        <html>
        <td class="t4"><a href="javascript:lu('OZ')">OZ</a>
        0204 F Class
        <a href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/ICN,itn/air/mp">
        ICN</a> to <a
        href="/cgi/get?apt:uMl8TIcSlHI*itn/airports/LAX,itn/air/mp">
        LAX</a></td>
        <tr>
        <td class="t4"><font color="white">UNITED</font></td>
        <td colspan="4" align="right">
        <strong>48,164</strong></td>
        </tr>
        <tr>
        <td class="t4"><font color="white">Star
        Alliance</font></td>
        <td colspan="4" align="right">
        <strong>49,072</strong></td>
        </tr>
        </html>
EOF

doc = Document.new string.gsub!(/\s+|&nbsp;/," ")
array = Array.new
XPath.each( doc, "//td[@colspan='4']/preceding-sibling::td/child::*") {

cell>

  array << cell.texts.to_s }
puts array

Dan Kohn wrote:

Daz, there's a bug in the CVS version of REXML.
The following code produces the error below, but works
perfectly with the default 1.8.2 REXML [...]

Thanks Dan.

The REXML code in that area looks quite "fluid" and
there are clear warnings to "turn away" (which I heeded ;).

I've filed a bug report which you might want to check over
and then keep a watch on.

http://www.germane-software.com/projects/rexml/ticket/32

Cheers,

daz