XML Parsing the Ruby way

I’ve been looking at REXML, and I really like the architecture: A very
ruby way to do standard XML parsing – it’s DOM and SAX-alike – but
it’s still hard to use as you have to be aware that you’re dealing with
XML. Since all XML happens to be is a serialization format for a tree-
and semi-graph structured data structure, why is there no library that
treats it as such?

I’m thinking that there has to be a much more Ruby way to deal with
tree-structured data – after all, Ruby objects could be seen as a tree
(or at least a graph, a tree being a subset of that) when in core, and
XML even has limited graph-representation support with IDREFs

What I’d like to see is a library (I /am/ working on code for this) that
would allow one to tell the parser what classes represent what tags in
what namespaces, and how to map them, so that all one ever needs to do
is pretend the XML file is a bunch of Ruby objects.

Here’s something more code-like to illustrate my idea:

John Doe Mixer-blender 15.32

and

class Invoice
attr_accessor :customer
attr_accessor :items
end

class Item
attr_accessor :description
attr_accessor :price
end

class Price < Number
attr_accessor :currency
attr_accessor :value
end

class Customer
attr_accessor :name
end

and some sort of map of namespace-tag-class triples should instantiate
Invoice, instantiate Customer, instantiate a string as name within
customer, instantiate an Array of items, connect Customer to Invoice and
connect the Array to Invoice.

Any ideas on how to simply express such a map?

I’m planning to implement the XML-specific stuff as a module that would
be mixed into instantiated classes when read from the XML file if it had
not been already. Basically, the class structure becomes a schema (It
would be possible to tell the parser to be lax and either ignore or
connect nodes that were undefined in the map, or to throw an exception)

I think it would also be possible to make a DOM module that would be a
mix-in so that standards-based parsing on arbitrary classes would be
possible.

I’d also love to plot this out so that the whole tree doesn’t have to be
available at once – DOM doesn’t actually require the whole thing to be
in core for most operations, so I know it’s possible. I’d like the
parser to be able to be SAX-alike, just spitting out objects serially,
and if they get garbage-collected, fine [if they have ID attributes,
though, I think I’d either mark how to get them back from the file in
another parse, or store the instantiated object in a hash]. If they
don’t get garbage-collected, then the API would automagically be more
DOM-alike, since all the objects would be in core.

This is probably already more words than the code to implement it would
be, but I think what I have in mind is clear… I’d love to hear what
other rubyists take on the problem-space is.

Ari

P.S. If I’m just a really lazy bum who should accept that dealing with
XML is hard, tell me so. I probably won’t believe you, though.

The difficulties in mapping objects to XML is partly in XML’s
complexity: Most languages do not let you assign your objects an
arbitrary ID and there’s no sane way to distinguish attributes from
other sub-nodes.

The difficulty is particularly great in a non-validating parser: It
doesn’t know enough about the schema or even the DTD to map things
automagically at all, or even semi-automagically.

What I’d like to see happen is have a schema layer appear on top of
REXML or other parsers that hides most or all of the XML-ness of the
data from other objects. This helps from other sides as well, since
uninvolved objects should know nothing of the storage format or method
for encapsulation reasons.

···

In other thoughts, I’d be perfectly happy if the library provided a few
"Stub" classes for tags that are not otherwise mapped. If the classes
being mapped to accept any object as a sub-type, and the schema allows
it, it would be fine to see generic “XMLTag” classes or something
similar instantiated along with the mapped classes. It would certainly
make handling all the cases much easier.

Also, a library of classes that match standard DTD elements would be
useful: I can see an “HTML” module forming that handles the xhtml
namespace, with powerful classes for HTML DTD elements like paragraphs
and rich text, images and control structures. I could see similar for
Docbook, and perhaps the common elements from both descend from a more
generic “document” module of classes.

I’ve heard rumour of Java class libraries for manipulating XML-based
standards this way, with a class per tag (or so) – the problem is that
the library is static, so every time the standard changed, the library
had to be extended. Since Ruby is so much more dynamic, this should be
almost no problem – at the worst, just fall back to a “Tag” class and
implement a tiny set of features. At best, nobody notices and life goes
on.


On the note of a standard library, mapping various parts of RDF into the
library would be useful: RDF:Bag and RDF:List map to hashes and/or
arrays very nicely; RDF:about as an attribute is something that could be
added to classes dynamically that would give them a static ID of sorts,
without the maintenance nightmare of just a list of document-specific ID
attributes that have to be kept unique within one domain, but once you
consider more than one document, could very easily become non-unique.

I imagine a library like this could be extremely powerful when combined
with an object prevalence system; an entire website or even a more
specific service (XML-RPC or SOAP based, or entirely new) could be
stored in-core, but the library could export an XML-ish “view” of the
object-space “model”, and all that would have to be written is a set of
"controllers".

Writing an RSS feed would be terribly easy, assuming the site data was
already parsed into Ruby objects somehow: All one would have to do is
walk one’s existing data structures and feeding appropriate data into an
RSS object. If the data was loaded from XML+RDF files, an Array might
already map to an RDF:List, so generating the RSS/RDF file gets even
that much easier /and/ more powerful.

I’ve been looking at REXML, and I really like the architecture: A very
ruby way to do standard XML parsing – it’s DOM and SAX-alike – but
it’s still hard to use as you have to be aware that you’re dealing with
XML. Since all XML happens to be is a serialization format for a tree-
and semi-graph structured data structure, why is there no library that
treats it as such?

I’m thinking that there has to be a much more Ruby way to deal with
tree-structured data – after all, Ruby objects could be seen as a tree
(or at least a graph, a tree being a subset of that) when in core, and
XML even has limited graph-representation support with IDREFs

There was an discussion on xml-dev on the tree-ness of XML:

I tend to side with the “it’s only a tree if you want to see it that
way” view.

An XML infoset or an XML DOM implementation may be a tree, but an XML document
is a big string. Or an array of characters. XML itself is about markup syntax
and Unicode. You can view an XML instance as a bunch of lists, hashes, and
scalar values, and while mapping it to a tree is handy, mapping it to a sequence
of events is another option. Why not map the document to lists of lists, and do
something LISP-y?

What I’d like to see is a library (I /am/ working on code for this) that
would allow one to tell the parser what classes represent what tags in
what namespaces, and how to map them, so that all one ever needs to do
is pretend the XML file is a bunch of Ruby objects.

and some sort of map of namespace-tag-class triples should instantiate
Invoice, instantiate Customer, instantiate a string as name within
customer, instantiate an Array of items, connect Customer to Invoice and
connect the Array to Invoice.

Any ideas on how to simply express such a map?

Not offhand, but I would have to believe the Java stuff on XML data-mapping
would have something useful.

By the way, what does this sort of mapping buy you that REXML + XPath
lacks? I’m guessing you don’t care for this sort of syntax:

puts doc[“/invoice/customer/name”]

Are you looking to manipulate XML-derived data/objects without
prior knowledge of names and structures? Or perhaps something
along the lines of E4X (ECMAScript for XML)?
http://dev2dev.bea.com/articles/JSchneider_XML.jsp

e.g.

puts invoice.customer.name

I’m planning to implement the XML-specific stuff as a module that would
be mixed into instantiated classes when read from the XML file if it had
not been already. Basically, the class structure becomes a schema (It
would be possible to tell the parser to be lax and either ignore or
connect nodes that were undefined in the map, or to throw an exception)

I think it would also be possible to make a DOM module that would be a
mix-in so that standards-based parsing on arbitrary classes would be
possible.

I’d also love to plot this out so that the whole tree doesn’t have to be
available at once – DOM doesn’t actually require the whole thing to be
in core for most operations, so I know it’s possible. I’d like the
parser to be able to be SAX-alike, just spitting out objects serially,
and if they get garbage-collected, fine [if they have ID attributes,
though, I think I’d either mark how to get them back from the file in
another parse, or store the instantiated object in a hash]. If they
don’t get garbage-collected, then the API would automagically be more
DOM-alike, since all the objects would be in core.

Have you looked at XML pull parsers?

This is probably already more words than the code to implement it would
be, but I think what I have in mind is clear… I’d love to hear what
other rubyists take on the problem-space is.

Ari

P.S. If I’m just a really lazy bum who should accept that dealing with
XML is hard, tell me so. I probably won’t believe you, though.

XML is easy. Certain programming lanaguages and preconceptions make it hard.

James

Aredridel wrote:

What I’d like to see is a library (I /am/ working on code for this) that
would allow one to tell the parser what classes represent what tags in
what namespaces, and how to map them, so that all one ever needs to do
is pretend the XML file is a bunch of Ruby objects.

I’ve done exactly this for C++ in my XmlBind library. It relies
on you implementing the XmlDocumentTypeHandler interface for each
document type (and namespace, in a version we aren’t using yet).
This interface gets asked to make objects for a passed element
name, and the returned objects must support the XmlNode interface.
Parent nodes get passed completed child nodes, which they are
responsible for maintaining. That leaves a fair bit of flexibility
in how you create the nodes, it works nicely.

After I did it I found a research paper that talked about someone
doing something pretty similar, which also used the term “bind” -
that might help you find it.

Unfortunately I can’t publish my code, it doesn’t belong to me.
But I’m sure you can work out the details, and will be glad to know
it can be done :-).

Clifford Heath, ManageSoft.

There was an discussion on xml-dev on the tree-ness of XML:
ActiveState Community - Boosting coder and team productivity with ready-to-use open source languages and tools.

I tend to side with the “it’s only a tree if you want to see it that
way” view.

I agree, though I choose to see it as a graph (with a tree being a
commonly found subset of that). I see a collection of documents as a
larger graph, too, with URI references joining them.

You can see it as a string, though for most manipulation, that makes it
harder to work with, not easier. Same for a list of lists (which is
ugly if you care about the attribute-versus-node distinction, because we
get a greater-than-1-dimensional tree, though not by much, or we get
ugly NQXML-alike document.child.node.child.note.child.attr[n] syntax,
instead of invoice.customer.name or invoice.items[n].quantity).

An XML infoset or an XML DOM implementation may be a tree, but an XML document
is a big string. Or an array of characters. XML itself is about markup syntax
and Unicode. You can view an XML instance as a bunch of lists, hashes, and
scalar values, and while mapping it to a tree is handy, mapping it to a sequence
of events is another option. Why not map the document to lists of lists, and do
something LISP-y?

Because that’s moving backward, not forward. A list of lists could be
called a tree, but then, handling even the simplest IDREF->ID case would
be a monstrous task – something I’d like to see written once in the
general case, then just used.

An XML document is more than just a string – it’s highly structured.
XML itself is about a meta-data-model, not just it’s serialization. I’m
interested in mapping the XML data-model into Ruby’s internal data
model: classes. I don’t want another generic XML parser; I want a
system for mapping DTDs/schemas to Ruby.

and some sort of map of namespace-tag-class triples should instantiate
Invoice, instantiate Customer, instantiate a string as name within
customer, instantiate an Array of items, connect Customer to Invoice and
connect the Array to Invoice.

Any ideas on how to simply express such a map?

Not offhand, but I would have to believe the Java stuff on XML data-mapping
would have something useful.

Not really – it was statically defined for a single DTD. The map was
hard-coded into the second-tier parser (that took SAX output and
converted that into other classes). I want something dynamic, so one
can supply a DTD, a mapping, a class library and a document, and get
instances of Ruby objects from the class library.

By the way, what does this sort of mapping buy you that REXML + XPath
lacks? I’m guessing you don’t care for this sort of syntax:

puts doc[“/invoice/customer/name”]

It’s not bad, but the objects involved are decided by the parser, not
the programmer. The result of that is a String, but the result of
/invoice/customer is not a kind_of Customer, nor is /invoice/items an
Array.

Are you looking to manipulate XML-derived data/objects without
prior knowledge of names and structures? Or perhaps something
along the lines of E4X (ECMAScript for XML)?
http://dev2dev.bea.com/articles/JSchneider_XML.jsp

e.g.

puts invoice.customer.name

That would be great, though having it be schema-aware and very rubyish
is what I want. In this case, I’d want customer to be an instance of a
Customer class – I don’t want a generic “node” class. I’d love to have
the result that one works with be completely XML-independent – that way
it could be serialized with YAML or marshall just as easily.

I’m tackling several problems with this idea – one is an abstract
interface to serialization, and inside of that, I’d like to have the
XML-based serialization be schema-aware, or at least DTD-aware, and as
transparent as possible.

Have you looked at XML pull parsers?

Yes – I like what I see, but I want more random-access. That could be
in a future implementation, though, since it can be faked by just
pulling the whole thing into core.

P.S. If I’m just a really lazy bum who should accept that dealing with
XML is hard, tell me so. I probably won’t believe you, though.

XML is easy. Certain programming lanaguages and preconceptions make it hard.

That’s what I think, too. XML may be a clumsy beast in some ways, but
that part’s already been coded for (Thanks, REXML!). The data model is
actually somewhat elegant, once you take it all into account.

Ari

···

On Sat, 2003-04-26 at 21:28, james_b@neurogami.com wrote: