REXML ... performance & memory usage

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ... it currently eats almost 800Mb of ram before it seems to do anything ...

Does anybody have any tips on getting REXML to run faster and/or smaller ???

I know it's slow just because it's pure ruby ... and there's a lot going on ... but ... I can sit here for many minutes just waiting for ANY console output showing that it's actually gotten to the first root.elements.each( xpath_expr ) iteration ...

Hints/Tips are/would be VERY much appreciated.

Thanks in advance.

jd

Jeff Wood wrote:

Does anybody have any tips on getting REXML to run faster and/or smaller ???

If having a pure ruby parser is not a requirement and you're on *nix, then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

Jeff Wood wrote:

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

At that file size, I'd also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there's the premature optimization quote that
says to wait with that just yet.

David Vallner

magic/xml has extremely convenient stream parsing interface.
It's based on REXML so it's pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.

The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.

It's something like:

XML.parse_as_twigs(STDIN) {|node|
  next unless node.name == :page
  node.complete! # Read all children of <page>...</page> node
  t = node[:@title] # :@title is a child
  i = node[:@id] # :@id is another child
  print "#{i}: #{t}\n"
}

A short tutorial at http://zabor.org/taw/magic_xml/tutorial.html

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

Enjoy :slight_smile:

···

On 11/4/06, Jeff Wood <jeff@dark-light.com> wrote:

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

Does anybody have any tips on getting REXML to run faster and/or smaller
???

I know it's slow just because it's pure ruby ... and there's a lot going
on ... but ... I can sit here for many minutes just waiting for ANY
console output showing that it's actually gotten to the first
root.elements.each( xpath_expr ) iteration ...

Hints/Tips are/would be VERY much appreciated.

--
Tomasz Wegrzanowski [ http://t-a-w.blogspot.com/ ]

Tom Werner wrote:

Jeff Wood wrote:

Does anybody have any tips on getting REXML to run faster and/or smaller ???

If having a pure ruby parser is not a requirement and you're on *nix, then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

I had to make two fixes to the source to get things to compile

ruby_xml_parser.c & ruby_xml_document.c both needed to have #include "stdargs.h" included ... the compiler wasn't happy about trying to deal with the va_list data type without it.

But, it's compiling now ... just thought I'd pass the information along for ya.

After modifying my script to use the libxml binding ... it's sitting @ about 220M used instead of 800+M ... ( better ) ... and does only take 10-20 seconds to start iterating over data ...

So, thank you for the pointer ...

jd

David Vallner wrote:

Jeff Wood wrote:

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

At that file size, I'd also slightly start thinking of biting into the
bitter pill and using stream / pull parsing instead of tree parsing.
Even with using a C parser a DOM buildup is not going to do much good
for performance if you need to to processing at that scale more than
seldom. But then again, there's the premature optimization quote that
says to wait with that just yet.

I would not necessarily call that premature optimization. If these kinds of files are to be parsed frequently and if only a portion of them needs extracting then I would also go down the stream parser road.

transform the XML tree of a document into some other object structure. IMHO the coding effort for transforming a DOM into another object tree vs. doing the same with the stream approach is quite equivalent. And runtime wise you save yourself one whole tree traversal by going stream.

Kind regards

  robert

···

From my experience stream parsers are also appropriate if you have to

Tomasz Wegrzanowski wrote:

···

On 11/4/06, Jeff Wood <jeff@dark-light.com> wrote:

Wow ... I am trying to use REXML to parse through an 8.8Mb xml file ...
it currently eats almost 800Mb of ram before it seems to do anything ...

Does anybody have any tips on getting REXML to run faster and/or smaller
???

I know it's slow just because it's pure ruby ... and there's a lot going
on ... but ... I can sit here for many minutes just waiting for ANY
console output showing that it's actually gotten to the first
root.elements.each( xpath_expr ) iteration ...

Hints/Tips are/would be VERY much appreciated.

magic/xml has extremely convenient stream parsing interface.
It's based on REXML so it's pretty slow, but it handles hundreds of
MBs big XMLs using just a few MBs of memory.

The idea is simple - you give it a block, and the block
keeps getting incomplete subtrees. It can either decide
to complete the current subtree (all children read to memory),
or to get inside it.

It's something like:

XML.parse_as_twigs(STDIN) {|node|
next unless node.name == :page
node.complete! # Read all children of <page>...</page> node
t = node[:@title] # :@title is a child
i = node[:@id] # :@id is another child
print "#{i}: #{t}\n"
}

A short tutorial at http://zabor.org/taw/magic_xml/tutorial.html

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

Enjoy :slight_smile:

Thanks for the tip, I'll have to take a look...

jd

Back in the world of j... there are these libs (nux and dom4j and
probably more). They let you stream parse and register callbacks to
xpath expressions. Whenever a registered xpath is encountered it
invokes the callback for that xpath using a dom object (not w3c
DOM...) for the complete sub tree. This is very convenient and raises
the abstraction a bit (the xpath part) from what seems to be your
approach. They don't allow full xpath but only those parts that make
sense in this context.

Anyways, look into it, it's very nice.

/Marcus

ps. I think XML processing tools sucks quite a bit in Ruby (I love
Ruby...). You cannot do high performance processing in a cross
platform way (as far as I know). Libxml on *nix or MSXML on win (since
REXML sucks perfomance wise). It's kind of sad. Is it impossible to
make libxml/libxsl work on Windows?

···

On 11/9/06, Tomasz Wegrzanowski <tomasz.wegrzanowski@gmail.com> wrote:

I think subtree-based parsers are a great tradeoff between
convenience of read-everything parsers and low memory use
of stream-based parsers. Deciding inside a block seems
much more natural than predefining matched tags (like
in Perl's XML::Twig).

I can vouch for that. I changed a bit of slow code from using REXML to
libxml, with fairly minor alterations. The work didn't take long, and
it made a huge difference:

REXML: 0.539 seconds
libxml: 0.012 seconds

Paul.

···

On 03/11/06, Tom Werner <pubsub@rubyisawesome.com> wrote:

If having a pure ruby parser is not a requirement and you're on *nix,
then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

Jeff Wood wrote:

After modifying my script to use the libxml binding ... it's sitting @ about 220M used instead of 800+M ... ( better ) ... and does only take 10-20 seconds to start iterating over data ...

WOW.

You might try optimizing your XPath query. I'm no expert at this (or even knowledgeable), but I did find in the past that changing the XPath sometimes made a drastic difference in performance.

Devin

Jeff,

I recently ported the (freeware) Chilkat XML parser to Ruby, but it only runs
on Windows. I'm curious to see how it performs in comparison. Do you have
a simple example w/ data that I can use to convert to Chilkat XML? I'll be happy
to write the code...

Best Regards,
Matt Fausey

···

At 06:33 PM 11/3/2006, you wrote:

Tom Werner wrote:

Jeff Wood wrote:

Does anybody have any tips on getting REXML to run faster and/or smaller ???

If having a pure ruby parser is not a requirement and you're on *nix, then you can get great performance out of:

http://libxml.rubyforge.org/

It uses libxml2 for the parsing, and as such is quite speedy.

Tom

I had to make two fixes to the source to get things to compile

ruby_xml_parser.c & ruby_xml_document.c both needed to have #include "stdargs.h" included ... the compiler wasn't happy about trying to deal with the va_list data type without it.

But, it's compiling now ... just thought I'd pass the information along for ya.

After modifying my script to use the libxml binding ... it's sitting @ about 220M used instead of 800+M ... ( better ) ... and does only take 10-20 seconds to start iterating over data ...

So, thank you for the pointer ...

jd

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.25/515 - Release Date: 11/3/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.25/515 - Release Date: 11/3/2006

It's a good job I try to keep up with happenings on ruby-talk :slight_smile: Thanks
for posting about this - it's fixed in CVS now.

Also, given your input data, you might be interested to know that I'm
currently working on a developmental branch for libxml-ruby 0.4, which
includes a new, faster SAX callback interface (among many other
changes). The branch name is DEV_0_4, and it's getting to be quite
stable now.

Also, we have a mailing list:

  http://rubyforge.org/mail/?group_id=494

Thanks again,

···

On Sat, 2006-11-04 at 09:33 +0900, Jeff Wood wrote:

Tom Werner wrote:
> Jeff Wood wrote:
>> Does anybody have any tips on getting REXML to run faster and/or
>> smaller ???
>>
>
> If having a pure ruby parser is not a requirement and you're on *nix,
> then you can get great performance out of:
>
> http://libxml.rubyforge.org/
>
> It uses libxml2 for the parsing, and as such is quite speedy.
>
> Tom
>
>
I had to make two fixes to the source to get things to compile

ruby_xml_parser.c & ruby_xml_document.c both needed to have #include
"stdargs.h" included ... the compiler wasn't happy about trying to deal
with the va_list data type without it.

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk

Marcus Bristav wrote:

Back in the world of j...

*groan*
*facedesk*
*moan*

Right, can ANYONE explain this braindead fad to me?

Hint: No matter what some of the more loudmouthed bloggers would like to
insinuate in the massive ongoing circlejerk of FUD (from both the Ruby
and the Java side of things):

A) There is no conspiracy of panicking Java (yes, that IS the word)
developers desperately trying to eradicate Ruby in fear for their jobs

B) Having more advanced development tools doesn't increase your penis
size nor girth

C) Being able to code without advanced development tools doesn't
increase your penis size nor girth

D) Blog commenters that swoon over keypress count comparisons aren't
visionaries that have Seen The Truth, they're hapless muppets without
much attention span and too much time on their hands, people that get
actual work can tell what's completely irrelevant to actual practice and
so much waste of webspace and bandwidth

E) Ruby won't kill Java, Java won't kill Ruby, C# won't kill Java, Ruby
won't kill Python, Ajax won't kill the desktop, ActiveRecord won't kill
Hibernate, Rails won't kill Rife, Rife won't kill Rails...

F) No matter how long, or with which fervency you'll compare apples to
oranges, they won't taste equally good to all people

</rant>

Now, is there any chance the general audience of this mailing list will
ever be able to mention other programming languages for the sake of
comparison without in some way indicating revilement of such or
reluctance to do so?

David Vallner

PS: I wonder how many people will see this considering points B and C
are likely to send spam filters into a hissy fit.

Dammit! Another six-hundred quid down the drain...

···

On Thu, 2006-11-09 at 20:58 +0900, David Vallner wrote:

B) Having more advanced development tools doesn't increase your penis
size nor girth

--
Ross Bamford - rosco@roscopeco.REMOVE.co.uk