Stream Parsing with REXML

Hi (again, sort of) :slight_smile:

I am still on my quest to write a program that parses a large XML file.
After having tried to do it in tree mode, I had to realize that the
performance was simply abysmal. So back to the drawing board. But, and
here is the thing...I could find a good straight-forward tutorial on how
to write a stream parser using REXML. The official tutorial is pretty
much mute on that part and the only other example I found (or rather was
pointed to -
http://www.janvereecken.com/2007/4/11/event-driven-xml-parser-in-ruby)
was way too complex for someone like me who is still pretty much a
beginner in ruby.

So, what I am looking for is either a brief description of how to write
an event driven parser or else a link to a good and simple tutorial.

For the former, this is what the parser should do:

Find the element "Gene-ref", allow me to access its children and then
close and repeat for the next "Gene-ref entry. In xml code, that would
look like

<something here>
<Gene-ref>
聽聽<name>...</name>
聽聽<start>...</start>
聽聽<end>...</end>
</Gen-ref>
<Gen-ref>
...

I understand that I need a Listener class, like
classListener
聽聽def tag_start(name, attrs)
聽聽end
聽聽def text(text)
聽聽end
聽聽def tag_end(name)
聽聽end
end

But I havent really worked with classes all that much and maybe someone
could just put down the basics for the script from where I can start
experimenting? Would be very much appreciated. Let's say for each
element "Gene-ref" I want to puts the name, start and end in one line,
or something along those lines.

Cheers,

Marc

路路路

--
Posted via http://www.ruby-forum.com/.

I have found that hpricot works very well with large xml files. It
streams it automatically. I use it to parse a 5M xml file and it does
it rather quickly.

http://code.whytheluckystiff.net/doc/hpricot/
http://code.whytheluckystiff.net/hpricot/

xml = <<ENDXML
oh hai
<root>
  <kid1 foo="bar" jim="jam" />
  <kid2 foo="sausage">
    <grandkid1>I ate &lt;raw&gt; sausages&tm;!</grandkid1>
    <grandkid2>
      <![CDATA[
        I ate <raw> sausages&tm;!
      ]]>
    </grandkid2>
  </kid2>
</root>
whoa!
ENDXML

class Listener
  def initialize
    # Use instance variables to mantain information
    # between different callbacks.
    @nest_level = 0
  end

  def tag_start( name, attr_hash )
    @nest_level += 1
    print " " * @nest_level
    puts "<#{name} ...> #{attr_hash.inspect}"
  end

  def text( str )
    print " " * @nest_level
    p str
  end

  # Treat CDATA sections just like text
  alias_method :cdata, :text

  def tag_end( name )
    print " " * @nest_level
    puts "</#{name}>"
    @nest_level -= 1
  end
end

require 'rexml/document'
require 'stringio' # for this demo
stream_handler = Listener.new
pseudo_file = StringIO.new( xml )
REXML::Document.parse_stream( pseudo_file, stream_handler )

#=> OUTPUT:
#=> "oh hai\n"
#=> <root ...> {}
#=> "\n "
#=> <kid1 ...> {"jim"=>"jam", "foo"=>"bar"}
#=> </kid1>
#=> "\n "
#=> <kid2 ...> {"foo"=>"sausage"}
#=> "\n "
#=> <grandkid1 ...> {}
#=> "I ate <raw> sausages&tm;!"
#=> </grandkid1>
#=> "\n "
#=> <grandkid2 ...> {}
#=> "\n "
#=> "\n I ate <raw> sausages&tm;!\n "
#=> "\n "
#=> </grandkid2>
#=> "\n "
#=> </kid2>
#=> "\n"
#=> </root>
#=> "\nwhoa!\n"

路路路

On Jan 12, 9:58 am, Marc Hoeppner <marc.hoepp...@molbio.su.se> wrote:

Hi (again, sort of) :slight_smile:

I am still on my quest to write a program that parses a large XML file.
After having tried to do it in tree mode, I had to realize that the
performance was simply abysmal. So back to the drawing board. But, and
here is the thing...I could find a good straight-forward tutorial on how
to write a stream parser using REXML. The official tutorial is pretty
much mute on that part and the only other example I found (or rather was
pointed to -http://www.janvereecken.com/2007/4/11/event-driven-xml-parser-in-ruby\)
was way too complex for someone like me who is still pretty much a
beginner in ruby.

So, what I am looking for is either a brief description of how to write
an event driven parser or else a link to a good and simple tutorial.

Here's a more targeted example, since I'm feeling helpful:

class Gene
  attr_accessor :name, :start, :end
  def initialize( name=nil, start=nil, endd=nil )
    @name = name
    self.start = start
    self.end = endd
  end
  def start=( start )
    @start = start.to_i
  end
  def end=( endd )
    @end = endd.to_i
  end
  def duration
    @end - @start
  end
  def to_s
    "<Gene '#{@name}' #{@start}-#{@end} (#{duration})> "
  end
end

class GeneParser
  GENE_TAG_NAME = "Gene-ref"
  def tag_start( name, attributes )
    case name
      when "root"
        # ignore
      when GENE_TAG_NAME
        @current_gene = Gene.new
      else
        @current_property = name
    end
  end

  def text( str )
    if @current_gene && @current_property
      @current_gene.send( @current_property.to_s + "=", str )
    end
  end

  def tag_end( name )
    if name == GENE_TAG_NAME
      puts @current_gene
      @current_gene = nil
    else
      @current_property = nil
    end
  end
end

require 'rexml/document'
REXML::Document.parse_stream( DATA, GeneParser.new )
#=> OUTPUT:
#=> <Gene 'foo' 17-42 (25)>
#=> <Gene 'bar' 43-50 (7)>

__END__
<root>
  <Gene-ref>
    <name>foo</name>
    <start>17</start>
    <end>42</end>
  </Gene-ref>
  <Gene-ref>
    <name>bar</name>
    <start>43</start>
    <end>50</end>
  </Gene-ref>
</root>

路路路

On Jan 12, 9:58 am, Marc Hoeppner <marc.hoepp...@molbio.su.se> wrote:

Find the element "Gene-ref", allow me to access its children and then
close and repeat for the next "Gene-ref entry. In xml code, that would
look like

<something here>
<Gene-ref>
<name>...</name>
<start>...</start>
<end>...</end>
</Gen-ref>
<Gen-ref>
...

Hi Marc and Dusty,

I have found that hpricot works very well with large xml files. It
streams it automatically. I use it to parse a 5M xml file and it does
it rather quickly.

Dusty, you sure hpricot is streaming?

At any rate, it is a lot faster than any pure Ruby XML parser. Marc, if performance is your only problem hpricot is worth looking at. It'll only take a few seconds to know, it's something like 4 lines of Ruby to find out:

require 'rubygems'
require 'hpricot'

doc = open(<your file name here>) do |f|
   Hpricot.XML(f)
end

If that works well enough you're probably set. Hpricot's search and query stuff is pretty nice.

If you still have problems, let us know. If you really need streaming you have a few options.

http://code.whytheluckystiff.net/doc/hpricot/
http://code.whytheluckystiff.net/hpricot/

Cheers,
Bob

路路路

On 12-Jan-08, at 1:55 PM, dusty wrote:

----
Bob Hutchison -- tumblelog at http://www.recursive.ca/so/
Recursive Design Inc. -- weblog at http://www.recursive.ca/hutch
http://www.recursive.ca/ -- works on http://www.raconteur.info/cms-for-static-content/home/

Thanks for the great responses!

Just for clarification though:

tag_start identifies an element like "gene" ala

def tag_start(name, attrib)
聽聽if name=="gene"
聽聽聽do something here
聽聽end
end

So tag_end is then used if I want to puts everything that was done with
the element and its children like storing some values in an array or
something?

/Marc

路路路

--
Posted via http://www.ruby-forum.com/.

Yeah, you could do that. Basically it's completely up to you. The parser just hands off events for parsed XML items (like starting tags, closing tags, text data etc.) in the order found in the document.

I have attached another example of an idiom that I use frequently. This may not be needed in your case but who knows? Basic idea is that the event listener creates nested listeners (one per element) and hands processing off to them while maintaining a stack of listeners. That way you can do different processing based on element name, attributes or whatever. Might be overkill for your simple example but OTOH if you do need to do complex processing steps based on elements this might be exactly what you need. For example, you can store any information you need in a nested listener and do all the processing in end tag.

Kind regards

  robert

rexml-stream.rb (1.52 KB)

路路路

On 13.01.2008 13:37, Marc Hoeppner wrote:

Thanks for the great responses!

Just for clarification though:

tag_start identifies an element like "gene" ala

def tag_start(name, attrib)
  if name=="gene"
   do something here
  end
end

So tag_end is then used if I want to puts everything that was done with the element and its children like storing some values in an array or something?

Hi Marc, Robert:

I have attached another example of an idiom that I use frequently. This may not be needed in your case but who knows? Basic idea is that the event listener creates nested listeners (one per element) and hands processing off to them while maintaining a stack of listeners. That way you can do different processing based on element name, attributes or whatever. Might be overkill for your simple example but OTOH if you do need to do complex processing steps based on elements this might be exactly what you need. For example, you can store any information you need in a nested listener and do all the processing in end tag.

That idiom is really frequent in the stuff that I do with XML. It also happens to be a case where a pull parser can make things clearer. Pull parsers are event based like SAX parsers are, they do not build up trees in memory, and they are very fast. The code below is almost equivalent to yours Robert, as far as I could make it anyway. The parents method is different, and I habitually add a node level corresponding to the document since there are things (comments, processing instructions, etc) that are allowed outside of the root element of an XML file.

I wrote that xampl-pp parser in May 2002. It is on source forge <xampl: XML pull parsers download | SourceForge.net > but not as a gem. Hmm, maybe I should do something about that, I've made some fixes to very obscure problems over the years, but never released anything. Anyway the point isn't xampl-pp so much as what a pull parser looks like. Many people don't know about them and if they do they often don't know why they are useful. I think the libxml2 library does pull parsing (and that'll be very fast) and I think REXML does too (it has a few problems with namespaces last I looked). Anyway...

All the interesting stuff happens in the work method.

#!/bin/env ruby

require "xampl-pp"

class Thing

   attr_accessor :tag_name
   attr_accessor :parent
   attr_accessor :text

   def initialize(parent, name)
     @parent = parent
     @tag_name = name
     @text = ""
   end

   def Thing.start(xml)
     pp = Xampl_PP.new
     pp.input = xml

     Thing.new(nil, 'doc').work(pp)
   end

   def parents
     p = self
     while p
       yield p
       p = p.parent
     end
   end

   def work(pp)
     parents do |par|
       print par.tag_name, " "
     end
     puts

     while not pp.endDocument? do
       case pp.nextEvent
         when Xampl_PP::START_ELEMENT
           Thing.new(self, pp.name).work(pp)
         when Xampl_PP::END_ELEMENT
           print " ", text.inspect, "\n"
           return
         when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION || Xampl_PP::ENTITY_REF then
           text << pp.text
       end
     end
   end
end

Thing.start(DATA)

__END__
<root>
  <Gene-ref>
    <name>foo</name>
    <start>17</start>
    <end>42</end>
  </Gene-ref>
  <Gene-ref>
    <name>bar</name>
    <start>43</start>
    <end>50</end>
  </Gene-ref>
</root>

路路路

On 13-Jan-08, at 1:04 PM, Robert Klemme wrote:

----
Bob Hutchison -- tumblelog at http://www.recursive.ca/so/
Recursive Design Inc. -- weblog at http://www.recursive.ca/hutch
http://www.recursive.ca/ -- works on http://www.raconteur.info/cms-for-static-content/home/

Hi Marc, Robert:

I have attached another example of an idiom that I use frequently. This may not be needed in your case but who knows? Basic idea is that the event listener creates nested listeners (one per element) and hands processing off to them while maintaining a stack of listeners. That way you can do different processing based on element name, attributes or whatever. Might be overkill for your simple example but OTOH if you do need to do complex processing steps based on elements this might be exactly what you need. For example, you can store any information you need in a nested listener and do all the processing in end tag.

That idiom is really frequent in the stuff that I do with XML. It also happens to be a case where a pull parser can make things clearer. Pull parsers are event based like SAX parsers are, they do not build up trees in memory, and they are very fast. The code below is almost equivalent to yours Robert, as far as I could make it anyway. The parents method is different,

I considered that as well but wanted to be able to traverse parent nodes top down for the example. After all it was just a quick hack. :slight_smile: For more robustness and library fitness I would certainly change a few things.

and I habitually add a node level corresponding to the document since there are things (comments, processing instructions, etc) that are allowed outside of the root element of an XML file.

If you look closely that happens in my version as well (see StreamListener#initialize and #tag_start).

I wrote that xampl-pp parser in May 2002. It is on source forge <xampl: XML pull parsers download | SourceForge.net > but not as a gem. Hmm, maybe I should do something about that, I've made some fixes to very obscure problems over the years, but never released anything. Anyway the point isn't xampl-pp so much as what a pull parser looks like. Many people don't know about them and if they do they often don't know why they are useful. I think the libxml2 library does pull parsing (and that'll be very fast) and I think REXML does too (it has a few problems with namespaces last I looked). Anyway...

All the interesting stuff happens in the work method.

#!/bin/env ruby

require "xampl-pp"

class Thing

   attr_accessor :tag_name
   attr_accessor :parent
   attr_accessor :text

   def initialize(parent, name)
     @parent = parent
     @tag_name = name
     @text = ""
   end

   def Thing.start(xml)
     pp = Xampl_PP.new
     pp.input = xml

     Thing.new(nil, 'doc').work(pp)
   end

   def parents
     p = self
     while p
       yield p
       p = p.parent
     end
   end

   def work(pp)
     parents do |par|
       print par.tag_name, " "
     end
     puts

     while not pp.endDocument? do
       case pp.nextEvent
         when Xampl_PP::START_ELEMENT
           Thing.new(self, pp.name).work(pp)
         when Xampl_PP::END_ELEMENT
           print " ", text.inspect, "\n"
           return
         when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION || Xampl_PP::ENTITY_REF then
           text << pp.text
       end
     end
   end
end

Thing.start(DATA)

Yep, looks pretty similar. To be honest I never used an XML pull parser myself. Personally I prefer the push parser a bit because it avoids the looping and decision logic based on event and element type (in #work). Are there major advantages of pull parsers over push parsers that I have overlooked so far?

Kind regards

  robert

路路路

On 13.01.2008 20:54, Bob Hutchison wrote:

On 13-Jan-08, at 1:04 PM, Robert Klemme wrote:

Ok, I am still on this and am starting to feel a bit stupid...I guess
there is something wrong with the following example, but maybe I should
step back for a while. Anyhow - what's missing to make it work:

The xml file:

<Gene-ref>
聽聽<name>Gene1</name>
聽聽<start>1</start>
聽聽<end>10</end>
</Gen-ref>
<Gene-ref>
聽聽<name>Gene2</name>
聽聽<start>11</start>
聽聽<end>20</end>
</Gen-ref>
<Gene-ref>
聽聽<name>Gene3</name>
聽聽<start>21</start>
聽聽<end>30</end>
</Gen-ref>
<Gene-ref>
聽聽<name>Gene4</name>
聽聽<start>31</start>
聽聽<end>40</end>
</Gen-ref>
<Gene-ref>
聽聽<name>Gene5</name>
聽聽<start>41</start>
聽聽<end>50</end>
</Gen-ref>

The code:

data = File.new(ARGV.shift)

class Listener

聽聽def tag_start( name, attr_hash )
聽聽聽聽if name == 'Gene-ref'
聽聽聽聽聽@name = name
聽聽end
聽聽end

聽聽def text( str )

聽聽end

聽聽def tag_end( name )
聽聽聽聽if name == 'Gene-ref'
聽聽聽聽聽聽puts @name
聽聽聽聽聽聽@name = nil
聽聽聽聽end
聽聽end
end

require 'rexml/document'
REXML::Document.parse_stream( data, Listener.new )

路路路

--
Posted via http://www.ruby-forum.com/.

Hi Robert,

Yep, looks pretty similar. To be honest I never used an XML pull parser myself. Personally I prefer the push parser a bit because it avoids the looping and decision logic based on event and element type (in #work). Are there major advantages of pull parsers over push parsers that I have overlooked so far?

I use both and I'm comfortable with both. The combination of sax and pull gives you a much more complete toolkit for streaming XML.

The biggest advantage (for me) to a pull parser is when you are parsing an XML file and using those events to drive some kind of algorithm. Either the parser or the algorithm is going to have to be able to suspend processing (and I mean save state by that) and resume when the other needs it to. This can be really hard to do for some algorithms, in the case of some it isn't possible without fundamentally changing the algorithm. If you read about pull parsers they will talk about where the flow of control (the 'outermost' loop) is in the program: in the parser or in the application. With sax the parser controls the flow, with pull parsers the application does. A subset of that situation is that the pull parser is just an object that you can pass around, and that can be *really* handy. This also means you can have more than one pull parser operating at a time, and that's incredibly confusing with sax.

You can fake a pull parser by putting a sax parser in a different thread and streaming the events over some kind of channel with some kind of blocking protocol. Some implementations of pull parsers do exactly that, and that is *slow*, yet if you need it people will put up with it. If you've ever had the temptation to do that then you might want to think pull parser.

It is pretty much trivial to write a sax parser based on a pull parser at very small cost (xampl-pp ships with an example that does that). So one trick that you can play is switching back and forth between sax and pull where appropriate.

Junior programmers have a much easier time with pull parsers. It seems to provide a gentler route to sax based software. XML is a big enough step. There is some advantage to avoiding the need for a beginner to learn both XML and event based programming at the same time.

Historically pull parsers tend to be very fast compared to sax parsers. This is not going to last, there's no technical reason for such a difference. In the Java world the speed difference is eroding for the good of all. Anyway, speed isn't the issue it used to be. If you poke around the web you'll still see a bunch of references to performance advantages, this doesn't apply anymore if you choose your parser correctly.

Cheers,
Bob

路路路

On 14-Jan-08, at 2:50 AM, Robert Klemme wrote:

----
Bob Hutchison -- tumblelog at http://www.recursive.ca/so/
Recursive Design Inc. -- weblog at http://www.recursive.ca/hutch
http://www.recursive.ca/ -- works on http://www.raconteur.info/cms-for-static-content/home/

Ok, I am still on this and am starting to feel a bit stupid...I guess
there is something wrong with the following example, but maybe I should
step back for a while. Anyhow - what's missing to make it work:

The xml file:

<Gene-ref>
  <name>Gene1</name>
  <start>1</start>
  <end>10</end>
</Gen-ref>
<Gene-ref>
  <name>Gene2</name>
  <start>11</start>
  <end>20</end>
</Gen-ref>
<Gene-ref>
  <name>Gene3</name>
  <start>21</start>
  <end>30</end>
</Gen-ref>
<Gene-ref>
  <name>Gene4</name>
  <start>31</start>
  <end>40</end>
</Gen-ref>
<Gene-ref>
  <name>Gene5</name>
  <start>41</start>
  <end>50</end>
</Gen-ref>

The code:

data = File.new(ARGV.shift)

class Listener

  def tag_start( name, attr_hash )
    if name == 'Gene-ref'
     @name = name
  end
  end

  def text( str )

You need to place the @name storing code here. But to make it work
you need to remember the current tag's name because you want to only
store "str" in @name if you are in <name>. This makes it necessary
that you add a stack to your listener class which gets pushed and
popped whenever tag_start and tag_end are invoked.

  end

  def tag_end( name )
    if name == 'Gene-ref'
      puts @name
      @name = nil
    end
  end
end

require 'rexml/document'
REXML::Document.parse_stream( data, Listener.new )

Cheers

robert

路路路

2008/1/14, Marc Hoeppner <marc.hoeppner@molbio.su.se>:

--
use.inject do |as, often| as.you_can - without end

> Yep, looks pretty similar. To be honest I never used an XML pull
> parser myself. Personally I prefer the push parser a bit because it
> avoids the looping and decision logic based on event and element
> type (in #work). Are there major advantages of pull parsers over
> push parsers that I have overlooked so far?

I use both and I'm comfortable with both. The combination of sax and
pull gives you a much more complete toolkit for streaming XML.

I read that in the light of your following remarks. Other than that I
do not see a difference with regard to the XML parsed, i.e. both
approaches are equally powerful.

The biggest advantage (for me) to a pull parser is when you are
parsing an XML file and using those events to drive some kind of
algorithm. Either the parser or the algorithm is going to have to be
able to suspend processing (and I mean save state by that) and resume
when the other needs it to. This can be really hard to do for some
algorithms, in the case of some it isn't possible without
fundamentally changing the algorithm. If you read about pull parsers
they will talk about where the flow of control (the 'outermost' loop)
is in the program: in the parser or in the application. With sax the
parser controls the flow, with pull parsers the application does.

I see, that's definitively a situation I was not aware of. I guess I
just haven't been in a situation where I needed to employ such
algorithms. And yes, the instance in control (parser vs. application
code) seems to be the biggest difference. I can see how this makes a
difference for some algorithms.

A
subset of that situation is that the pull parser is just an object
that you can pass around, and that can be *really* handy. This also
means you can have more than one pull parser operating at a time, and
that's incredibly confusing with sax.

:slight_smile:

You can fake a pull parser by putting a sax parser in a different
thread and streaming the events over some kind of channel with some
kind of blocking protocol. Some implementations of pull parsers do
exactly that, and that is *slow*, yet if you need it people will put
up with it. If you've ever had the temptation to do that then you
might want to think pull parser.

It is pretty much trivial to write a sax parser based on a pull parser
at very small cost (xampl-pp ships with an example that does that). So
one trick that you can play is switching back and forth between sax
and pull where appropriate.

Yep. Basically a push parser can be implemented as thin wrapper
around a pull parser.

Junior programmers have a much easier time with pull parsers. It seems
to provide a gentler route to sax based software. XML is a big enough
step. There is some advantage to avoiding the need for a beginner to
learn both XML and event based programming at the same time.

I haven't thought of that either but sounds like a valid point.

Historically pull parsers tend to be very fast compared to sax
parsers. This is not going to last, there's no technical reason for
such a difference. In the Java world the speed difference is eroding
for the good of all. Anyway, speed isn't the issue it used to be. If
you poke around the web you'll still see a bunch of references to
performance advantages, this doesn't apply anymore if you choose your
parser correctly.

Bob, thanks for taking the time and put together this elaborate explanation!

Kind regards

robert

路路路

2008/1/14, Bob Hutchison <hutch@recursive.ca>:

On 14-Jan-08, at 2:50 AM, Robert Klemme wrote:

--
use.inject do |as, often| as.you_can - without end