Keith Fahlgren wrote:
libxml is a mature C library and quite fast, but is (by default)
DOM-based (as is REXML).
Sorry, I did not express myself clearly. I definitely need a DOM-based approach, but REXML is a lot slower than libxml, and libxml can be a PITA to install on some platforms/distros (e.g. it took quite some time on my ubuntu box, because neither gem install nor apt-get wanted to install the newest version which I needed).
The catch is that I would like to use this in my web scraping framework, scRUBYt! - and of course dependency on libxml would mean that everybody who would like to install sRUBYt!, would have to install libxml too. I got tons of support requests from ubuntu users who have had problems installing mechanize on ubuntu (it is depending on libssl-ruby there), so I guess this number would be much higher in the case of libxml which has much more funky dependencies.
If there is no better possibility, I will go with libxml despite of this (this is my only concern, otherwise libxml is fine) - but it would be better to have something install-friendly...
What sort of "real" XPaths do you need? XPath 1.0? 2.0?
Real in the sense that it is not Hpricot XPath, which ATM can not even do
/my/stuff/is/@cool
not to talk about more complex expressions.
I guess XPath 1.0 would be completely enough (maybe even Hpricot's, with a few additions) - I really don't need anything complicated.
Deep-lookahead/behind? Do you have huge source documents?
Well, I am actually first building this document from what I have scraped, so I have pretty much control over it (if is too big, I just say stop and put the other records to a new doc etc.) so this is not really the problem.
I really just need a fast XML parser which is easy to install, that's all. scRUBYt! is a high-level framework, aimed also at non-programmers, so I can not expect that all my potential users are handy with debian's package policy and the joys of libxml installing on win32
Cheers,
Peter
···
_
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby