BA wrote:
Yes, I want to extract the PDAT element, however, I want to use the B110 tag to find this element. The XML *is* predictable, however, there are variations in the placement of the elements (there could be several different address fields and/or many paragraphs that need to be parsed/searched). The files are *extremely* large (some could be as large as 1-2GB).
Any time you're faced with a huge XML file and you only want to get small pieces of it, you should think about using a stream parser, a la Java's SAX2. The idea behind a stream parser is that the parser runs through the entire file exactly once, and never has to seek forwards or backwards; it throws the data to you in whatever size chunks are most convenient for it. This means the parser does as little work as possible, which means so long as your code is efficient, the end result should be maximally efficient.
The downside is that you have to do a bit more work. For example, depending on how the parser buffers things internally, it might send you a piece of text inside an XML element in two or more pieces, and expect you to glue them together. It's also up to you to deal with any position-based restrictions on which elements you're interested in.
I assumed you wanted the text inside any PDAT element that was *somewhere* inside a B110 element, so I simply track which elements are currently "open" and by how many levels. Here's the code.
···
---
require 'rexml/document'
require 'rexml/parsers/streamparser'
class MyListener
def initialize
# Hash to record which elements we are inside at any given moment, and
# how many of them we are inside
@inside = Hash.new
@textbuffer = ''
end
def tag_start(name, attrs)
if @inside[name]
@inside[name] += 1
else
@inside[name] = 1
end
end
def text(text)
if @inside['B110'] and @inside['PDAT']
@textbuffer += text
end
end
def tag_end(name)
if name == 'PDAT'
# Output the text if we just closed a PDAT inside a B110
if @inside['B110'] and @inside['PDAT']
puts @textbuffer
end
# Clear the buffer any time we close a PDAT
@textbuffer = ''
end
# Decrement count, set to nil if zero
# so @inside['foo'] works as a boolean
if @inside[name] == 1
@inside[name] = nil
else
@inside[name] -= 1
end
end
end
listener = MyListener.new
source = File.new "mydoc.xml"
REXML::Document.parse_stream(source, listener)
---
Here's a sample file:
---
<FOO>
<B110><B110>
<SOMETHINGELSE>
<PDAT>This is the text you want</PDAT>This isn't.
</SOMETHINGELSE>
</B110>
<PDAT>This is sneaky good text</PDAT>
</B110>
<PDAT>This is bad text</PDAT>
</FOO>
---
Output:
---
This is the text you want
This is sneaky good text
---
Note that this uses the native REXML stream parser API, not the SAX2 clone, because the SAX2 clone is slower according to the documentation.
Disclaimers:
The above code is only lightly tested. Although a stream parser should theoretically be the fastest option, I haven't actually benchmarked it against (say) the pull parser. (Which is also documented as having an unstable API, so personally I'd avoid it anyway.)
The above code will break if you have a PDAT somewhere inside a PDAT. I'm assuming that's not allowed. If it is, you'll have to make your text buffer be a stack of strings rather than a simple string, append to @textbuffer[@inside['PDAT']], and take the performance hit.
Also, if you need to do more elaborate selection of which elements to process, you'll obviously need to make changes to how the current position is tracked... e.g. implementing "process PDAT elements only if they are not buried more than 2 other elements deep inside a B110 element" is left as an exercise for the reader 
> (started doing this by parsing the file line by line, however,
ran into malformed XML where I decided that I needed to use the database functionality of XML.
If by "malformed XML" you mean syntactically invalid XML, such as unescaped < > characters, then you may be hosed, as REXML's parsers will likely choke on it.
mathew
--
<URL:http://www.pobox.com/~meta/>