REXML in C


(Radu M. Obadã) #1

Hi,
In the README file, Sean says that REXML is ‘reasonably fast’, so I
guess it’s not as fast as it could be. So I was wondering: since C is so
portable, and creating Ruby extensions in C is most of the times pretty
trivial, why isn’t REXML implemented (at least parts of it) that way? I
guess one of the reasons Sean would invoke would be that we’ll lose
portability: is that so? It seems that many standard Ruby extensions are
built that whay, so why isn’t REXML, too? At least, we could fork a new
project and try to implement speed critical parts of code in C, I assume
that would bring some serious performance improvements. Anyway, in a
month or so I’ll have alot of spare time, so with Sean’s help, I’ll be
more than happy to try a C implementation of this great parser, while
keeping the excellent API the same. What do you say?
Best wishes,
Radu

This e-mail was scanned by RAV AntiVirus!

···

Xnet scaneaza automat toate mesajele impotriva virusilor folosind RAV AntiVirus.
Xnet automatically scans all messages for viruses using RAV AntiVirus.

Nota: RAV AntiVirus poate sa nu detecteze toti virusii noi sau toate variantele lor. Va rugam sa luati in considerare ca exista un risc de fiecare data cand deschideti fisiere atasate si ca MobiFon nu este responsabila pentru nici un prejudiciu cauzat de virusi.
Disclaimer: RAV AntiVirus may not be able to detect all new viruses and variants. Please be aware that there is a risk involved whenever opening e-mail attachments to your computer and that MobiFon is not responsible for any damages caused by viruses.


(Tobi Reif) #2

Radu M. Obadã wrote:

why isn’t REXML implemented (at least parts of it) that way?

I enjoy poking around the Ruby code, tweaking stuff while trying to find
fixes, and to be able to read the Ruby code while subclassing REXML
classes; eg reading what super calls etc.

An alternatice C implementation as 100% drop in replacement should keep
up with Ruby REXML, feature and bug wise, or there will be
incompatibilities.

Tobi

···


http://www.pinkjuice.com/


#3

“Radu M. =?iso-8859-1?b?T2JhZOM=?=” whizkid@xnet.ro writes:

Hi,
In the README file, Sean says that REXML is ‘reasonably fast’, so I
guess it’s not as fast as it could be. So I was wondering: since C is so
portable, and creating Ruby extensions in C is most of the times pretty
trivial, why isn’t REXML implemented (at least parts of it) that way? I
guess one of the reasons Sean would invoke would be that we’ll lose
portability: is that so? It seems that many standard Ruby extensions are
built that whay, so why isn’t REXML, too? At least, we could fork a new
project and try to implement speed critical parts of code in C, I assume
that would bring some serious performance improvements. Anyway, in a
month or so I’ll have alot of spare time, so with Sean’s help, I’ll be
more than happy to try a C implementation of this great parser, while
keeping the excellent API the same. What do you say?
Best wishes,
Radu

I have to deal with really large (300,000 line) XML files, and REXML,
well, just doesn’t work for me. I really hate that since I really
like REXML. My idea, which I sort-of-started-but-haven’t-
had-time-to-really-work-on-yet, was to back the REXML API with
libxml2. It has a fast parser and a fast XPath implementation, so I
think it could be very nice. I’m very interested in helping any
project that makes REXML faster.

Steve


(SER) #4

Sean Russell wrote:

Incidentally, if REXML’s SAX parser doesn’t support namespaces – and it
may not; I threw the SAX support together pretty quickly – then it will
shortly. Like, by the 2.4.0 release of REXML, which should be within a
week.

Just checked; REXML’s SAX parser does generate namespace events after all.

···

… POLITICALLY CORRECT: A phrase used to limit people’s freedom of
<|> thought cf. Moral Majority.
/|\
/|


(Avi Bryant) #5

Sean Russell ser@germane-software.com wrote in message news:3d1fe947@news.mhogaming.com

That said, I’d really like a Ruby-to-native compiler, because interpreted
languages are great for RAD, but tend to suck (relatively) on execution.
Even Java, which is pretty fast, still gets nailed on the speed thing. Try
Xalan vs. xsltproc sometime. Having native binaries is really handy.

Sigh. There is a common misconception that compiling to native code
will result in some kind of magical speed increase. It doesn’t work
that way; it’s not the interpreter that makes Ruby (or Java) slower
than C, it’s the semantics. That is, it’s precisely the things you
like about Ruby - dynamic method dispatch, dynamic typing, blocks,
runtime code definition, garbage collection, bounds checking, etc, etc

  • that make it slower than well tuned C code. Translating Ruby into C
    or ASM isn’t going to help; conversely, a C program that used the same
    techniques as Ruby programs would run just as slow. There is a
    fundamental trade-off there that you’re not going to sidestep simply
    by throwing GCC into the mix. Yes, the Ruby interpreter could be made
    faster, but there’s no reason to think a native code compiler is the
    best way to do that; personally, I’d much rather see a bytecode VM.
    In fact, on modern machines a JITted bytecode interpreter can
    outperform compiled code (by modifying code on the fly to take
    advantage of branch prediction). Thus, a good Java or Smalltalk VM
    can have faster method dispatch than C++.

To illustrate my point, here are some benchmarks on the GCJ project,
which provides exactly what you’re proposing - a GCC frontend - for
Java instead of Ruby. Not surprisingly, the best Java VMs almost
always perform better than the native code:
http://www.shudo.net/jit/perf/

Compared to Ruby, Java is a fairly static language; I’m pretty sure
you wouldn’t get results even that good with a hypothetical GCR
compiler. If you want to see the kind of speed Ruby can (somewhat)
realistically hope to obtain, the current state of the art for an
equivalently dynamic language is probably Cincom’s VisualWorks
Smalltalk - which, by the way, doesn’t use native code. You could
also look at David Simmons’ SmallScript, but it’s currently
Windows-only.

Cheers,
Avi


(Aidan) #6

slumos@unlv.edu wrote:

extensions are built that whay, so why isn’t REXML, too? At least, we
could fork a new project and try to implement speed critical parts of
code in C, I assume that would bring some serious performance
improvements.

I have to deal with really large (300,000 line) XML files, and REXML,
well, just doesn’t work for me.

Before anyone gets too carried away with the idea …
we shouldn’t overlook the fact that Ruby already has C based XML parsers

  • Yoshida Masato’s expat based “XMLParser” package, and libgdome-ruby
    for example, Before I rushed off to -translate- a rapidly changing
    REXML package into C, I would check if these existing parsers showed
    the kind of performance advantages you are both hoping for.

Compiling to C is not always “go faster juice”, as anyone who has used
Java compilers can testify. There are some benchmarks on the XMLRuby
site, if I remember, which showed that REXML is sometimes faster than
naive implementations.

Aidan


(Bryan Murphy) #7

I have to deal with really large (300,000 line) XML files, and REXML,
well, just doesn’t work for me. I really hate that since I really
like REXML. My idea, which I sort-of-started-but-haven’t-
had-time-to-really-work-on-yet, was to back the REXML API with
libxml2. It has a fast parser and a fast XPath implementation, so I
think it could be very nice. I’m very interested in helping any
project that makes REXML faster.

Steve

In all honesty, I really don’t think you should be processing 300,000
line XML files with any DOM-like XML interface. There is a LOT of
overhead and wasted memory in the DOM tree, and for an XML file that big
it must be huge (not to even consider the time it takes to parse the XML
into the DOM tree). Every time you access that information you’re
scanning the contents of the ENTIRE XML document in memory. With a
document that size this is bad bad bad!

What I think you really should be using is some sort of streaming parser
or pull based parser when you are dealing with documents of this
magnitude. You can build parsers that are considerably faster (orders
of magnitude) and load data into much more compact (and applicable) data
structures in memory.

Yes, I know that writing a state based streaming parser is a bit harder
than doing the same with REXML, but when you are dealing with this
magnitude of data the tradeoffs are worth it in the long run imho (and
building a good state based parser is a fun learning experience if
you’ve never done it before)!

Bryan


(Sean Chittenden) #8

I have to deal with really large (300,000 line) XML files, and
REXML, well, just doesn’t work for me. I really hate that since I
really like REXML. My idea, which I sort-of-started-but-haven’t-
had-time-to-really-work-on-yet, was to back the REXML API with
libxml2. It has a fast parser and a fast XPath implementation, so I
think it could be very nice. I’m very interested in helping any
project that makes REXML faster.

How about projects that will have a REXML like interface but are
faster? I was hoping to be further along before I tossed out an FYI
about it, but I’m writing a wrapper around libxml2. It’s crude at the
moment, assumes a working knowledge of the libxml API, and doesn’t
have any niceties, but it will eventually support most/all of the
Ruby-ish API fun that REXML has made famous. At the same time though,
the big chore is writing code to wrap around the libxml API so
functionality is plentiful. Schemas, name spaces, DOM, SAX, HTTP/FTP
client, etc. It’s all there, just needs a ruby API.

http://www.rubynet.org/modules/xml/ruby-libxml/

If anyone’s interested in helping with libxml stuff I’ll setup a
mailing list and will give them CVS access to help hack. If anyone’s
interested in this but would need guidance, I’ll help whoever has an
interest. -sc

···


Sean Chittenden


(Dave Thomas) #9

avi@beta4.com (Avi Bryant) writes:

You could also look at David Simmons’ SmallScript, but it’s
currently Windows-only.

I wonder what it would take to add Ruby to the list of languages David
is planning to integrate into AOS?

"Additional secondary languages that are supported or for which
their is planned work include: Python, PHP, Basic, JScript,
Scheme, C++, C, and assembly." (http://www.smallscript.net/#AOS)

(Tobi Reif) #10

Bryan Murphy wrote:

I have to deal with really large (300,000 line) XML files, and REXML,
well, just doesn’t work for me. I really hate that since I really
like REXML. […]
Steve

In all honesty, I really don’t think you should be processing 300,000
line XML files with any DOM-like XML interface.

True.

But REXML != tree API only; it also offers a stream API:
http://www.germane-software.com/software/rexml_doc/
http://www.germane-software.com/software/rexml_doc/classes/REXML/StreamListener.html

Tobi

···


http://www.pinkjuice.com/


#11

[ugh, sent the last message before it was finished. sorry.]

Bryan Murphy bryan@terralab.com writes:

In all honesty, I really don’t think you should be processing 300,000
line XML files with any DOM-like XML interface. There is a LOT of
overhead and wasted memory in the DOM tree, and for an XML file that big
it must be huge (not to even consider the time it takes to parse the XML
into the DOM tree). Every time you access that information you’re
scanning the contents of the ENTIRE XML document in memory. With a
document that size this is bad bad bad!

It seems like the claim that I should have to give up the nice
interface just because the problem gets large is fundamentally flawed
somehow. The acceptance of such rules of thumb is surely one of the
reasons why the XML world sucks so much.

The files are only 13MB or so, which isn’t even large by new PC
standards.

What I think you really should be using is some sort of streaming parser
or pull based parser when you are dealing with documents of this
magnitude. You can build parsers that are considerably faster (orders
of magnitude) and load data into much more compact (and applicable) data
structures in memory.

Yes, I know that writing a state based streaming parser is a bit harder
than doing the same with REXML, but when you are dealing with this
magnitude of data the tradeoffs are worth it in the long run imho (and
building a good state based parser is a fun learning experience if
you’ve never done it before)!

Bryan

Easy now. What I said was that I thought a REXML API backed by
libxml2 might be nice. It’s not as if my project is waiting for it,
or I would have written it by now. Since the project couldn’t wait, I
just implemented the parts where an XPath interface was convenient
using Perl and libxml2 instead. I tried REXMLBuilder (uses
XMLParser), and the parsing is fast, but XPath is still too slow.

To answer another post, I’ll probably look at the Ruby libgdome
interface next, but as to whether I’m expecting to be happy with it,
I’ll just quote Sean Russell:

“The extant XML APIs, in general, suck. They take a markup language
which was specifically designed to be very simple, elegant, and
powerful, and wrap an obnoxious, bloated, and large API around it.”

Steve


(Rich Kilmer) #12

Well, we talked to him about this exact thing at last year’s Ruby
conference, and his initial need was a Ruby lexer/parser written in
Ruby…Robert Feldt was working toward this (and I was helping him by
writing unit tests) but that project seams to have stalled some time
ago.

David really seamed to like Ruby, and I think it would be a major win to
have the wonderful syntax and semantics of Ruby on a powerful VM like
AOS.

-Rich

···

-----Original Message-----
From: dave@thomases.com [mailto:dave@thomases.com] On Behalf Of Dave
Thomas
Sent: Tuesday, July 02, 2002 7:48 PM
To: ruby-talk ML
Subject: Re: REXML in C

avi@beta4.com (Avi Bryant) writes:

You could also look at David Simmons’ SmallScript, but it’s
currently Windows-only.

I wonder what it would take to add Ruby to the list of languages David
is planning to integrate into AOS?

"Additional secondary languages that are supported or for which
their is planned work include: Python, PHP, Basic, JScript,
Scheme, C++, C, and assembly." (http://www.smallscript.net/#AOS)

(Bryan Murphy) #13

In all honesty, I really don’t think you should be processing 300,000
line XML files with any DOM-like XML interface.

True.

But REXML != tree API only; it also offers a stream API:
http://www.germane-software.com/software/rexml_doc/
http://www.germane-software.com/software/rexml_doc/classes/REXML/StreamListener.html

Tobi

That my dear sir, is a very good point! However REXML’s stream API
doesn’t handle Namespaces (at least not as far as I can tell). A Ruby
implementation of SAX2 is available here:

http://www.rubycolor.org/arc/rbsax-0.7.0pre0.tar.gz

And I highly recommend it. It even has very nice and convenient
interfaces for connecting to the other Ruby XML libraries! I didn’t
create it, so I can’t say anything about it’s development status but in
my work with it I’ve experienced no major problems.

Bryan


(James) #14

It seems like the claim that I should have to give up the nice
interface just because the problem gets large is fundamentally flawed
somehow. The acceptance of such rules of thumb is surely one of the
reasons why the XML world sucks so much.

The XML world may or may not suck, but this rule of thumb has nothing to do
with XML per se. It’s a general observation that loading a large object
into memory before performing any work on it may not be an optimum choice.
For example, if you wanted to do a simple search and replace on a 13MB text
file, would it be reasonable to load it into an in-memory structure and
then call gsub? Or might it be better to stream the file and work on small
subsets as they pass?

The files are only 13MB or so, which isn’t even large by new PC
standards.

If you consider that an acceptable size, then there’s really no problem.
Keep in mind that the file sizes may change, and not all APIs are designed
to scale for all sizes. An in-memory structure API requires a trade-off
between immediate access to all parts of an object versus the cost of
holding that object in memory. The value of that tradeoff will vary as the
problem size varies.

James


(Clifford Heath) #15

slumos@unlv.edu wrote:

"The extant XML APIs, in general, suck.

I decided that also, when we built our C++ infrastructure for XML, and
designed our interface (XmlBind) instead. You parse the file against
a document type handler, which gets to build whatever object you want
for an element, as long as it supports the XmlNodeI interface, or
XmlDocumentI for the document as a whole. You can decide to create no
object for an element, and the subtree vanishes. You can play various
other games, but basically it’s a customisable DOM API with good
support for either loading objects or scanning elements, with any mix
equally well supported. Heaps more useful than either SAX or DOM.

Just thought that might inspire someone to do the same for Ruby…
I’m not entirely happy with the lack of special support for namespaces,
but they could be addressed in the same sort of way.

···


Clifford Heath


(Tobi Reif) #16

slumos@unlv.edu wrote:

It seems like the claim that I should have to give up the nice
interface just because the problem gets large is fundamentally flawed
somehow. The acceptance of such rules of thumb is surely one of the
reasons why the XML world sucks so much.

Err; if you want to be able to randomly manipulate a tree, it will have
to be stored in memory. If that’s getting slow, then get a faster
machine or more memory, or go with streaming or pull. I can’t see how
this requirement to balance tradeoffs is in any way specific to XML.

The files are only 13MB or so, which isn’t even large by new PC
standards.

Then your hardware sure is capable of handling it. If not, find a good
compromise between hardware, flexibility, convenience, and speed. Again;
this is to be faced with every other notational system.

Tobi

···


http://www.pinkjuice.com/


(Tobi Reif) #17

Clifford Heath wrote:

I decided that also, when we built our C++ infrastructure for XML, and
designed our interface (XmlBind) instead. […] Heaps more useful than either SAX or DOM.

Perhaps you want to release it?

Tobi

···


http://www.pinkjuice.com/


#18

Tobias Reif tobiasreif@pinkjuice.com writes:

It seems like the claim that I should have to give up the nice
interface just because the problem gets large is fundamentally flawed
somehow. The acceptance of such rules of thumb is surely one of the
reasons why the XML world sucks so much.

Err; if you want to be able to randomly manipulate a tree, it will have
to be stored in memory. If that’s getting slow, then get a faster
machine or more memory, or go with streaming or pull. I can’t see how
this requirement to balance tradeoffs is in any way specific to XML.

What seems specific to XML is that any time you mention you have a
large problem you get an immediate “oh, you should use a stream parser
and go away”. Doesn’t that seem a little too HIN to anyone else? As
in I could have 10GB machines with XML parsers in silicon and somebody
would still say that? (Hey, for all anyone knows, I do!)

I tend to think that it’s what I have to do with the data that
decides which API I want to use.

The files are only 13MB or so, which isn’t even large by new PC
standards.

Then your hardware sure is capable of handling it. If not, find a good
compromise between hardware, flexibility, convenience, and speed. Again;
this is to be faced with every other notational system.

Tobi

Let me see if I can make my position clear:

  1. REXML is good.
  2. REXML is not “fast enough”[1] for really big files.
  3. Therefore I wasn’t able to use REXML recently (as in I already
    used something else instead) and that sucks, because
  4. I want to use REXML or something like it for XML, period (see 1).
  5. Therefore I need a faster REXML.
  6. I already know that libxml2 is “fast enough”[1] because I ended up
    using Perl/libxml2 in the end. (This also sucks via corollary 1:
    Ruby is good.)
  7. For my purposes, a fast parser isn’t enough, I need fast XPath
    (that’s why I couldn’t just use REXMLBuilder).

Maybe whatever I said didn’t come across the way I meant it to. My
impression (or imagination) was:

Radu: I’m thinking about rewriting parts of REXML in C to make it
faster, anyone like that idea?

Me: Hey, that’s great because I wasn’t able to use REXML recently for
this project which requires these really big files. And
actually I’ve been thinking about backing REXML with libxml2 to
make it faster too. And by the way, I’m really interested in
helping out.

Bryan: (meaning well) You shouldn’t be trying to do that anyway,
doctrine says you should use a stream parser.

Me (to self): oh yes, another stream parser, then instead of adding
rules to my system by adding a method and using introspection to
automatically find it, as I was planning, I can add a rule by
spreading the code out through a half dozen cases in a switch
instead. :slight_smile:

Steve

···

slumos@unlv.edu wrote:


[1] Since I’m mostly a research programmer, the ability to reimplement
major parts of a system quickly is much more important than whether
the result takes 5 hours instead of 1 hour to run. As long as it
doesn’t take 1 day instead of 1 hour.


(Clifford Heath) #19

Tobias Reif wrote:

Perhaps you want to release it?

I’d love to and asked permission 3 months ago (along with a whole
lot of other C++ infrastructure I built), but no-one has had time
to consider my request. I have the nod, but not the paper :-).
I guess I could safely build a version in Ruby on top of REXML
if I could find the time. It’s a pretty small layer on top of
expat as it stands.

···


Clifford Heath


(Tobi Reif) #20

slumos@unlv.edu wrote:

What seems specific to XML is that any time you mention you have a
large problem you get an immediate “oh, you should use a stream parser
and go away”.

Just think about the complex yourself.

I won’t engage in a long OT thread with endless posts.

You can use XML, and learn about its advantages and limitations, and
have realistic expectations; Fast XPath on 13 megs may not be one of
them; not even a “REXML in C” will do that.

It works great for me.

Or, if you simply don’t like XML, choose to not use it. It definitely is
not capable of magic.

Tobi

Doesn’t that seem a little too HIN to anyone else?

did you mean

Health Information Network,
Health-Info-Net,
High Intensity, or
Hull Identification Number?

···


http://www.pinkjuice.com/