SAX-based XPath

(Gary Shea) #1

I'm posting this in hope of getting some API suggestions.

I'm building a native stream-based Ruby XPath processor (or whatever it
would be called) in order to parse some gigabyte-scale XML files at
work. It will accept multiple XPath expressions and output events (SAX
for now) matching the union of the XPath expressions.

It currently only works with absolute, non-wildcarded, predicate-less
default-axis XPath expressions:

filter = XmlFilter::XPathFilter.new
filter.listener = XmlFilter::RecordingListener.new
filter.xpath = '/a/b/c'

parser = REXML::Parsers::SAX2Parser.new(File.open('some_file_path.xml'))
parser.listen = filter
parser.parse

This interface needs to be extended a little to work with multiple XPath
expressions, maybe:

filter.xpath = ['/a/b/c', '/d/e/f']

Any suggestions for a more Ruby-esque way to do it?

    Gary

(Robert) #2

Gary Shea wrote:

I'm posting this in hope of getting some API suggestions.

I'm building a native stream-based Ruby XPath processor (or whatever
it would be called) in order to parse some gigabyte-scale XML files at
work. It will accept multiple XPath expressions and output events
(SAX for now) matching the union of the XPath expressions.

It currently only works with absolute, non-wildcarded, predicate-less
default-axis XPath expressions:

filter = XmlFilter::XPathFilter.new
filter.listener = XmlFilter::RecordingListener.new
filter.xpath = '/a/b/c'

parser =
REXML::Parsers::SAX2Parser.new(File.open('some_file_path.xml'))
parser.listen = filter
parser.parse

This interface needs to be extended a little to work with multiple
XPath expressions, maybe:

filter.xpath = ['/a/b/c', '/d/e/f']

Any suggestions for a more Ruby-esque way to do it?

It seems you could simplify the interface a bit (or add a method) along
the lines of REXML so you can do

File.open('some_file_path.xml') do |io|
  XmlFilter::XPathFilter.parse(io, '/a/b/c', '/d/e/f') do |event, filter|
    # process event
  end
end

I'm unsure about the "filter" block parameter but it might be useful to
know the matching filter criterium. What do you think?

Kind regards

    robert