Fast XML parser, other than libxml

Hello all,

I am looking for a fast XML parser, other than libxml (REXML is not fast enough, and Hpricot won't do this time - I need 'real' XPaths etc).

Some time ago I read about xaggly, nut now the site seems to be dead.

Any other suggestions?

Cheers,
Peter

···

_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML). If you're really looking for speed, you'll go
with a streaming approach (SAX or otherwise, potentially from libxml).
What sort of "real" XPaths do you need? XPath 1.0? 2.0?
Deep-lookahead/behind? Do you have huge source documents?

Keith

···

On 4/3/07, Peter Szinek <peter@rubyrailways.com> wrote:

I am looking for a fast XML parser, other than libxml (REXML is not fast
enough, and Hpricot won't do this time - I need 'real' XPaths etc).

http://code.google.com/p/roxi/

Don't know how fast it is compared to libxml tough

···

Peter Szinek <peter@rubyrailways.com> wrote:

Any other suggestions?

--
Lawrence, oluyede.org - neropercaso.it
"It is difficult to get a man to understand
something when his salary depends on not
understanding it" - Upton Sinclair

Keith Fahlgren wrote:

libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML).

Sorry, I did not express myself clearly. I definitely need a DOM-based approach, but REXML is a lot slower than libxml, and libxml can be a PITA to install on some platforms/distros (e.g. it took quite some time on my ubuntu box, because neither gem install nor apt-get wanted to install the newest version which I needed).

The catch is that I would like to use this in my web scraping framework, scRUBYt! - and of course dependency on libxml would mean that everybody who would like to install sRUBYt!, would have to install libxml too. I got tons of support requests from ubuntu users who have had problems installing mechanize on ubuntu (it is depending on libssl-ruby there), so I guess this number would be much higher in the case of libxml which has much more funky dependencies.

If there is no better possibility, I will go with libxml despite of this (this is my only concern, otherwise libxml is fine) - but it would be better to have something install-friendly...

What sort of "real" XPaths do you need? XPath 1.0? 2.0?

Real in the sense that it is not Hpricot XPath, which ATM can not even do

/my/stuff/is/@cool

not to talk about more complex expressions.

I guess XPath 1.0 would be completely enough (maybe even Hpricot's, with a few additions) - I really don't need anything complicated.

Deep-lookahead/behind? Do you have huge source documents?

Well, I am actually first building this document from what I have scraped, so I have pretty much control over it (if is too big, I just say stop and put the other records to a new doc etc.) so this is not really the problem.

I really just need a fast XML parser which is easy to install, that's all. scRUBYt! is a high-level framework, aimed also at non-programmers, so I can not expect that all my potential users are handy with debian's package policy and the joys of libxml installing on win32 :slight_smile:

Cheers,
Peter

···

_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

Maybe then you'll simply have to decide whether ease of use or performance is more important to you.

Kind regards

  robert

···

On 04.04.2007 10:53, Peter Szinek wrote:

I really just need a fast XML parser which is easy to install, that's all. scRUBYt! is a high-level framework, aimed also at non-programmers, so I can not expect that all my potential users are handy with debian's package policy and the joys of libxml installing on win32 :slight_smile:

> libxml is a mature C library and quite fast, but is (by default)
> DOM-based (as is REXML).

Sorry, I did not express myself clearly. I definitely need a DOM-based
approach, but REXML is a lot slower than libxml, and libxml can be a
PITA to install on some platforms/distros (e.g. it took quite some time
on my ubuntu box, because neither gem install nor apt-get wanted to
install the newest version which I needed).

Yeah, you're right about libxml being a pain to install. If you hadn't
cared about installability, I was going to suggest JRuby + (some Java
parser)....

I guess XPath 1.0 would be completely enough (maybe even Hpricot's, with
a few additions) - I really don't need anything complicated.

Yeah, sorry that I don't know of any others.

JEG II wrote:

Sounds like it is time for FasterXML. :slight_smile:

Know of any good starting points? All the XPath 1.0 work I do is off
of libxml and all of the XPath 2.0 is off of Saxon (Java), so I'm not
sure what should be copied.

Keith

Keith

···

On 4/4/07, Peter Szinek <peter@rubyrailways.com> wrote:

Robert Klemme wrote:

···

On 04.04.2007 10:53, Peter Szinek wrote:

I really just need a fast XML parser which is easy to install, that's all. scRUBYt! is a high-level framework, aimed also at non-programmers, so I can not expect that all my potential users are handy with debian's package policy and the joys of libxml installing on win32 :slight_smile:

Maybe then you'll simply have to decide whether ease of use or performance is more important to you.

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

Cheers,
Peter
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby

Not really. I was mostly just making a joke about FasterCSV's name and how it was born.

I do think it's possible to get better performance than REXML offers without resorting to C, though C would be faster still, naturally. I do have some ideas about this, but I haven't actually spent the time to see if I could get a prototype running to prove them.

James Edward Gray II

···

On Apr 4, 2007, at 8:34 AM, Keith Fahlgren wrote:

JEG II wrote:

Sounds like it is time for FasterXML. :slight_smile:

Know of any good starting points? All the XPath 1.0 work I do is off
of libxml and all of the XPath 2.0 is off of Saxon (Java), so I'm not
sure what should be copied.

AFAIK REXML is the only pure Ruby XML parser - and it comes with the standard distribution. All others will likely have similar issues as libxml I guess.

  robert

···

On 04.04.2007 12:00, Peter Szinek wrote:

Robert Klemme wrote:

On 04.04.2007 10:53, Peter Szinek wrote:

I really just need a fast XML parser which is easy to install, that's all. scRUBYt! is a high-level framework, aimed also at non-programmers, so I can not expect that all my potential users are handy with debian's package policy and the joys of libxml installing on win32 :slight_smile:

Maybe then you'll simply have to decide whether ease of use or performance is more important to you.

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

You may find Tim Bray's recent in-depth experiments in attempting to
write a fast pure-Ruby XML parser instructive and informative:

http://www.tbray.org/ongoing/When/200x/2006/11/09/Optimizing-Ruby
http://www.tbray.org/ongoing/When/200x/2006/11/15/RS-Redux

···

On Apr 4, 12:00 pm, Peter Szinek <p...@rubyrailways.com> wrote:

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

--
Arto Bendiken | http://bendiken.net/

Sounds like it is time for FasterXML. :slight_smile:

James Edward Gray II

···

On Apr 4, 2007, at 6:15 AM, Robert Klemme wrote:

On 04.04.2007 12:00, Peter Szinek wrote:

Robert Klemme wrote:

On 04.04.2007 10:53, Peter Szinek wrote:

I really just need a fast XML parser which is easy to install, that's all. scRUBYt! is a high-level framework, aimed also at non-programmers, so I can not expect that all my potential users are handy with debian's package policy and the joys of libxml installing on win32 :slight_smile:

Maybe then you'll simply have to decide whether ease of use or performance is more important to you.

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

AFAIK REXML is the only pure Ruby XML parser - and it comes with the standard distribution.

And:

http://www.tbray.org/ongoing/When/200x/2006/11/23/RX-plus-YARV

The series is an interesting read. Tim's pretty focused on the character based parsing and in my experience that's always death in Ruby. It's the primary reason the standard CSV library is so slow, for example.

He says it's because Ruby's regex engine isn't really up to the task of handling non-UTF-8 input. I'm pretty sure I understand why that is, but he also basically admits that at least the lexing stage of XML reading is just looking for < and &. I guess the problem becomes that a UTF-16 document actually encodes that as two bytes? Well, surely the regex could be adapted to handle that. In fact, the key expressions could be swapped out for encoding-aware replacements. Then we can keep playing to Ruby's strengths, I hope. Or is it true that there are some encodings we can't effectively build expressions for?

Sorry for thinking out loud here. I'm just trying to better understand Tim's logic. It's interesting stuff.

I'll go try to read his code now and see what else I can learn...

James Edward Gray II

···

On Apr 4, 2007, at 5:05 PM, Arto Bendiken wrote:

On Apr 4, 12:00 pm, Peter Szinek <p...@rubyrailways.com> wrote:

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

You may find Tim Bray's recent in-depth experiments in attempting to
write a fast pure-Ruby XML parser instructive and informative:

ongoing by Tim Bray · An RX for Ruby Performance
ongoing by Tim Bray · RX Redux

James Edward Gray II wrote:

Robert Klemme wrote:

I really just need a fast XML parser which is easy to install, that's all. scRUBYt! is a high-level framework, aimed also at non-programmers, so I can not expect that all my potential users are handy with debian's package policy and the joys of libxml installing on win32 :slight_smile:

Maybe then you'll simply have to decide whether ease of use or performance is more important to you.

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

AFAIK REXML is the only pure Ruby XML parser - and it comes with the standard distribution.

Sounds like it is time for FasterXML. :slight_smile:

One pointer: REXML comes with quite a fast pullparser, and it should be possible to base some lightweight xml document lib on that. (The documentation says that the API should not be considered stable, but I'm sure that could be resolved with the REXML author.)

As a proof of concept, see the attached code. We use it in our company to load and process XML files generated by our tools and OpenOffice Calc.
I just tested it on a 1MB XML from an .ods file, which it loaded successfully in < 2 seconds.

Writing a fast XPath implementation to match this might be quite a challenge, though. :wink:

Dennis

xmlsimple2.rb (1.56 KB)

···

On Apr 4, 2007, at 6:15 AM, Robert Klemme wrote:

On 04.04.2007 12:00, Peter Szinek wrote:

On 04.04.2007 10:53, Peter Szinek wrote:

Is the inverse the reason that FasterCSV is so fast (because it uses
regular expressions)?

Thanks,
Keith

···

On 4/4/07, James Edward Gray II <james@grayproductions.net> wrote:

The series is an interesting read. Tim's pretty focused on the
character based parsing and in my experience that's always death in
Ruby. It's the primary reason the standard CSV library is so slow,
for example.

For the lazy: http://www.tbray.org/code/rx-yarv.tgz

···

On 4/4/07, James Edward Gray II <james@grayproductions.net> wrote:

I'll go try to read his code now and see what else I can learn...

James Edward Gray II wrote:

The series is an interesting read. Tim's pretty focused on the character based parsing and in my experience that's always death in Ruby. It's the primary reason the standard CSV library is so slow, for example.

Pardon my naïveté, but why isn't the answer to make libxml, wrapped in a Rubyonic high-level API, part of stdlib? Stable, fast, and it wouldn't be the first Ruby extension in stdlib.

Devin
(Not that I'm criticizing anybody for not doing this. I'm certainly not stepping up.)

Or, another thought, introduce an Iconv filter that normalizes the input to UTF-8. This probably degrades the performance against non-UTF-8 documents, but Tim's code had trouble in that area too.

Is it legal for a well behaved XML processor to expose character data to an application in an encoding other than the actual document encoding? I didn't see anything in the specification to suggest it wasn't.

James Edward Gray II

···

On Apr 4, 2007, at 7:45 PM, James Edward Gray II wrote:

On Apr 4, 2007, at 5:05 PM, Arto Bendiken wrote:

On Apr 4, 12:00 pm, Peter Szinek <p...@rubyrailways.com> wrote:

Should I interpret this as 'decide between REXML and libxml'?
There are really no other alternatives?

You may find Tim Bray's recent in-depth experiments in attempting to
write a fast pure-Ruby XML parser instructive and informative:

ongoing by Tim Bray · An RX for Ruby Performance
ongoing by Tim Bray · RX Redux

And:

ongoing by Tim Bray · RX + YARV

The series is an interesting read. Tim's pretty focused on the character based parsing and in my experience that's always death in Ruby. It's the primary reason the standard CSV library is so slow, for example.

He says it's because Ruby's regex engine isn't really up to the task of handling non-UTF-8 input. I'm pretty sure I understand why that is, but he also basically admits that at least the lexing stage of XML reading is just looking for < and &. I guess the problem becomes that a UTF-16 document actually encodes that as two bytes? Well, surely the regex could be adapted to handle that.

That is one of the two key reasons, yes:

1. This first one is summarized by this comment from Aristotle Pagaltzis in Tim's RX article series, "The fastest way to do something in Perl is frequently the one that implements the most costly step in the fewest ops. You can substitute Ruby, Python or the like for Perl; the basic statement holds in any case. For string processing, it generally means doing as much work as possible with pattern matching. The more time you spend inside the VM’s implementation of its opcodes rather than inside the opcode loader/dispatcher, the faster the code will go." For a comparison, have a peak at CSV::parse_body(). It's CSV's primary parser and it has a lot of steps.

2. Method calls are expensive in Ruby. You can see that CSV is calling things all over the place. For example, if you call CSV::parse() the primary call chain is something like:

CSV::parse()
CSV::Reader::create()
CSV::IOReader::new() # or StringReader
CSV::Reader#each()
CSV::IOReader#get_row()
CSV::parse_row()
CSV::parse_body()

The same call chain for FasterCSV is:

FasterCSV::parse()
FasterCSV::new()
FasterCSV::each()
FasterCSV::shift()

The object construction doesn't much matter, because it's one-time cost stuff. But look at each() down in both examples. CSV is iterating over a three method call chain. FasterCSV is just iterating over one. That adds up.

There are many other little tricks to speed up FasterCSV. But those two easily bring us 90% of the distance.

Just to be clear, I'm not trying to attack the standard CSV library. It's pretty proven and has more users than FasterCSV does. :wink: All of those calls make its interface more flexible and some prefer its design.

I'm just trying to share what I learned in my process of speeding it up.

James Edward Gray II

···

On Apr 4, 2007, at 7:53 PM, Keith Fahlgren wrote:

On 4/4/07, James Edward Gray II <james@grayproductions.net> wrote:

The series is an interesting read. Tim's pretty focused on the
character based parsing and in my experience that's always death in
Ruby. It's the primary reason the standard CSV library is so slow,
for example.

Is the inverse the reason that FasterCSV is so fast (because it uses
regular expressions)?

Apple is taking this road. They needed XML for some of their projects and REXML didn't meet their needs. They plan to bundle libxml, with Ruby bindings, in Leopard to get around this.

I think Matz tries to weight changes to the standard library very careful as they can break a lot of code. It's a tough balance to strike, for sure.

James Edward Gray II

···

On Apr 4, 2007, at 8:50 PM, Devin Mullins wrote:

James Edward Gray II wrote:

The series is an interesting read. Tim's pretty focused on the character based parsing and in my experience that's always death in Ruby. It's the primary reason the standard CSV library is so slow, for example.

Pardon my naïveté, but why isn't the answer to make libxml, wrapped in a Rubyonic high-level API, part of stdlib? Stable, fast, and it wouldn't be the first Ruby extension in stdlib.

Is it legal for a well behaved XML processor to expose character data to an application in an encoding other than the actual document encoding? I didn't see anything in the specification to suggest it wasn't.

This is routine, character encoding shouldn't be the application's problem.

···

On 5-Apr-07, at 10:40 AM, James Edward Gray II wrote:

James Edward Gray II

----
Bob Hutchison -- tumblelog at <http://www.recursive.ca/so/&gt;
Recursive Design Inc. -- <http://www.recursive.ca/&gt;
xampl for Ruby -- <http://rubyforge.org/projects/xampl/&gt;