Rexml

pdg · 6 November 2006 19:40

Hi All,

As a first exercise with Ruby, I am going through the Pickaxe book and
creating a jukebox. I haven't even tried to create an array of songs
yet, because I got distracted and wanted to work this out. I am trying
toi feed in the data from my iTunes xml file to it to get the data, I
can get it to work if I delete most of the xml file, but when it's 5-6
gig, rexml just seems to die. I have vaguely heard that stream parsing
may be the answer, but am totally unaware of how to use it.

here is the code in my xml reading program so far (saample.rb basically
just creates song items):

require 'rexml/document'
require "sample.rb"

doc = File.open("iTunes.xml")
xml = REXML::Document.new(doc)
name = "name"
artist = "artist"
time = 60
cnt = 0
xml.elements.each("//key") do |k|
if k.text == "Name" then
name = k.next_sibling.text
cnt += 1
end
if k.text == "Artist" then
artist = k.next_sibling.text
end
if k.text == "Total Time" then
time = k.next_sibling.text.to_i/1000.0
song = Song.new(name,artist,time)
song.to_s

end

end
puts cnt

David_Vallner · 6 November 2006 23:03

pdg wrote:

Hi All,

As a first exercise with Ruby, I am going through the Pickaxe book and
creating a jukebox. I haven't even tried to create an array of songs
yet, because I got distracted and wanted to work this out. I am trying
toi feed in the data from my iTunes xml file to it to get the data, I
can get it to work if I delete most of the xml file, but when it's 5-6
gig,

OMFG. That's a -huge- XML file. Probably all of my MP3s together would
fit into there with base64-encoded contents

rexml just seems to die. I have vaguely heard that stream parsing
may be the answer, but am totally unaware of how to use it.

Well, time to learn. I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing - which
normally "blows up" the original XML document's size in bytes five times
and more. And REXML definitely doesn't have performance of any kind
amongst its qualities. (And for completeness' sake, I never 'clicked'
with the API either, but I'm a minority there.)

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.

After that, Google is your friend. Look at the documentation to
whichever parser you decided to use and use that - personally, I don't
do much / no non-tree XML parsing at all, so I'm mainly guessing around
on this. The main difference is that while with REXML, you can
arbitrarily look around the XML document, with stream and pull parsing,
you can only process the document in order, and have to keep the state
of that processing (e.g. which track you're currently "working on") in
your Ruby code.

David Vallner

Mark_T · 7 November 2006 01:28

Best to lean towards a database approach when you get to large files.
Neat thing working with XML & REX.
Then you can go to SleepyCat DBxml.
Though the routines are different, that's fer sure.
Someone has a neat Ruby lib for it out there.
Away from my machines for details.

Markt

···

On 11/7/06, pdg <pgattphoto@gmail.com> wrote:

Hi All,

Aaron_Patterson2 · 6 November 2006 23:11

pdg wrote:
> Hi All,
>
> As a first exercise with Ruby, I am going through the Pickaxe book and
> creating a jukebox. I haven't even tried to create an array of songs
> yet, because I got distracted and wanted to work this out. I am trying
> toi feed in the data from my iTunes xml file to it to get the data, I
> can get it to work if I delete most of the xml file, but when it's 5-6
> gig,

OMFG. That's a -huge- XML file. Probably all of my MP3s together would
fit into there with base64-encoded contents

> rexml just seems to die. I have vaguely heard that stream parsing
> may be the answer, but am totally unaware of how to use it.
>

Well, time to learn. I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing - which
normally "blows up" the original XML document's size in bytes five times
and more. And REXML definitely doesn't have performance of any kind
amongst its qualities. (And for completeness' sake, I never 'clicked'
with the API either, but I'm a minority there.)

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.

Ruby comes with a pull parser in the standard lib:
http://ruby-doc.org/stdlib/libdoc/rexml/rdoc/classes/REXML/Parsers/PullParser.html

I would give it a try on a document that large.

···

On Tue, Nov 07, 2006 at 08:03:40AM +0900, David Vallner wrote:

After that, Google is your friend. Look at the documentation to
whichever parser you decided to use and use that - personally, I don't
do much / no non-tree XML parsing at all, so I'm mainly guessing around
on this. The main difference is that while with REXML, you can
arbitrarily look around the XML document, with stream and pull parsing,
you can only process the document in order, and have to keep the state
of that processing (e.g. which track you're currently "working on") in
your Ruby code.

David Vallner

--
Aaron Patterson
http://tenderlovemaking.com/

Jeff_Wood1 · 6 November 2006 23:12

David Vallner wrote:

pdg wrote:

Hi All,

As a first exercise with Ruby, I am going through the Pickaxe book and
creating a jukebox. I haven't even tried to create an array of songs
yet, because I got distracted and wanted to work this out. I am trying
toi feed in the data from my iTunes xml file to it to get the data, I
can get it to work if I delete most of the xml file, but when it's 5-6
gig,

OMFG. That's a -huge- XML file. Probably all of my MP3s together would
fit into there with base64-encoded contents

rexml just seems to die. I have vaguely heard that stream parsing
may be the answer, but am totally unaware of how to use it.

Well, time to learn. I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing - which
normally "blows up" the original XML document's size in bytes five times
and more. And REXML definitely doesn't have performance of any kind
amongst its qualities. (And for completeness' sake, I never 'clicked'
with the API either, but I'm a minority there.)

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.

After that, Google is your friend. Look at the documentation to
whichever parser you decided to use and use that - personally, I don't
do much / no non-tree XML parsing at all, so I'm mainly guessing around
on this. The main difference is that while with REXML, you can
arbitrarily look around the XML document, with stream and pull parsing,
you can only process the document in order, and have to keep the state
of that processing (e.g. which track you're currently "working on") in
your Ruby code.

David Vallner

Actually, I recently had to rewrite an xml parser to go stream ( SAX ) style ... REXML made the task VERY easy ...

Yes, it's not the fastest thing there is, but it was "fast enough" ...

Definitely try writing it with REXML before taking the route of anything heavier.

jd

Chilkat_Software · 6 November 2006 23:20

Is that a mistake? Out of curiosity I took a look on my wife's computer
(she's the iPod user) and her XML file was only 231KB. The structure
of it conforms to the code you shared, so I know it's the right file...

Did you mean to say MB instead of GB?

-Matt

···

At 05:03 PM 11/6/2006, you wrote:

pdg wrote:
> Hi All,
>
> As a first exercise with Ruby, I am going through the Pickaxe book and
> creating a jukebox. I haven't even tried to create an array of songs
> yet, because I got distracted and wanted to work this out. I am trying
> toi feed in the data from my iTunes xml file to it to get the data, I
> can get it to work if I delete most of the xml file, but when it's 5-6
> gig,

OMFG. That's a -huge- XML file. Probably all of my MP3s together would
fit into there with base64-encoded contents

> rexml just seems to die. I have vaguely heard that stream parsing
> may be the answer, but am totally unaware of how to use it.
>

Well, time to learn. I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing - which
normally "blows up" the original XML document's size in bytes five times
and more. And REXML definitely doesn't have performance of any kind
amongst its qualities. (And for completeness' sake, I never 'clicked'
with the API either, but I'm a minority there.)

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither. That means libxml2, expat, or Xerces.
Compiling Required - I think the one-click installer comes with one of
these, buggered if I know which.

After that, Google is your friend. Look at the documentation to
whichever parser you decided to use and use that - personally, I don't
do much / no non-tree XML parsing at all, so I'm mainly guessing around
on this. The main difference is that while with REXML, you can
arbitrarily look around the XML document, with stream and pull parsing,
you can only process the document in order, and have to keep the state
of that processing (e.g. which track you're currently "working on") in
your Ruby code.

David Vallner

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.28/518 - Release Date: 11/4/2006

James_Edward_Gray_II · 6 November 2006 23:27

I probably never even saw a computer that could
handle a XML file that size using straightforward DOM parsing

This is off-topic but I have a theory that it's possible using a variant of the Flyweight pattern with index offsets into the document and reparsing individual tags on demand. (I would use weak referencing to cache them after a parse.)

I've been meaning to code up a proof of concept here and just haven't had time yet...

You want a Ruby binding to a stream or pull parser - to my best
knowledge, REXML is neither.

REXML includes a stream parser.

James Edward Gray II

···

On Nov 6, 2006, at 5:03 PM, David Vallner wrote:

David_Vallner · 7 November 2006 01:49

Mark T wrote:

Best to lean towards a database approach when you get to large files.
Neat thing working with XML & REX.
Then you can go to SleepyCat DBxml.
Though the routines are different, that's fer sure.
Someone has a neat Ruby lib for it out there.
Away from my machines for details.

Markt

He's not the one creating the file. So unless you can persuade Apple to
use a XML DB to store iTunes playlists...

(PS: The whole concept of XML DBs is an abomination. The XML Infoset
concept looks like a bloated cloudfest compared to relational data
storage...)

David Vallner

Skotty · 6 November 2006 23:16

I wish I had the foggiest idea of what you guys were talking about.
(Roobist here)
I'm still working on Y's book.

···

On Tue, 2006-11-07 at 08:12 +0900, Jeff Wood wrote:

David Vallner wrote:
> pdg wrote:
>
>> Hi All,
>>
>> As a first exercise with Ruby, I am going through the Pickaxe book and
>> creating a jukebox. I haven't even tried to create an array of songs
>> yet, because I got distracted and wanted to work this out. I am trying
>> toi feed in the data from my iTunes xml file to it to get the data, I
>> can get it to work if I delete most of the xml file, but when it's 5-6
>> gig,
>>
>
> OMFG. That's a -huge- XML file. Probably all of my MP3s together would
> fit into there with base64-encoded contents
>
>
>> rexml just seems to die. I have vaguely heard that stream parsing
>> may be the answer, but am totally unaware of how to use it.
>>
>>
>
> Well, time to learn. I probably never even saw a computer that could
> handle a XML file that size using straightforward DOM parsing - which
> normally "blows up" the original XML document's size in bytes five times
> and more. And REXML definitely doesn't have performance of any kind
> amongst its qualities. (And for completeness' sake, I never 'clicked'
> with the API either, but I'm a minority there.)
>
> You want a Ruby binding to a stream or pull parser - to my best
> knowledge, REXML is neither. That means libxml2, expat, or Xerces.
> Compiling Required - I think the one-click installer comes with one of
> these, buggered if I know which.
>
> After that, Google is your friend. Look at the documentation to
> whichever parser you decided to use and use that - personally, I don't
> do much / no non-tree XML parsing at all, so I'm mainly guessing around
> on this. The main difference is that while with REXML, you can
> arbitrarily look around the XML document, with stream and pull parsing,
> you can only process the document in order, and have to keep the state
> of that processing (e.g. which track you're currently "working on") in
> your Ruby code.
>
> David Vallner
>
>
Actually, I recently had to rewrite an xml parser to go stream ( SAX )
style ... REXML made the task VERY easy ...

Yes, it's not the fastest thing there is, but it was "fast enough" ...

Definitely try writing it with REXML before taking the route of anything
heavier.

jd

--
You have a new sung; unsung.
I sing a song falling upon deaf ears,
unsung.

skt
(shyguyfrenzy@gmail.com)

Mark_Van_Holstyn1 · 6 November 2006 23:28

If you want speed, look at libxml-ruby. It is many many times faster than
REXML, and it supports SAX parsing as well.

Mark

···

On 11/6/06, Chilkat Software <support@chilkatsoft.com> wrote:

Is that a mistake? Out of curiosity I took a look on my wife's computer
(she's the iPod user) and her XML file was only 231KB. The structure
of it conforms to the code you shared, so I know it's the right file...

Did you mean to say MB instead of GB?

-Matt

At 05:03 PM 11/6/2006, you wrote:

>pdg wrote:
> > Hi All,
> >
> > As a first exercise with Ruby, I am going through the Pickaxe book and
> > creating a jukebox. I haven't even tried to create an array of songs
> > yet, because I got distracted and wanted to work this out. I am trying
> > toi feed in the data from my iTunes xml file to it to get the data, I
> > can get it to work if I delete most of the xml file, but when it's 5-6
> > gig,
>
>OMFG. That's a -huge- XML file. Probably all of my MP3s together would
>fit into there with base64-encoded contents
>
> > rexml just seems to die. I have vaguely heard that stream parsing
> > may be the answer, but am totally unaware of how to use it.
> >
>
>Well, time to learn. I probably never even saw a computer that could
>handle a XML file that size using straightforward DOM parsing - which
>normally "blows up" the original XML document's size in bytes five times
>and more. And REXML definitely doesn't have performance of any kind
>amongst its qualities. (And for completeness' sake, I never 'clicked'
>with the API either, but I'm a minority there.)
>
>You want a Ruby binding to a stream or pull parser - to my best
>knowledge, REXML is neither. That means libxml2, expat, or Xerces.
>Compiling Required - I think the one-click installer comes with one of
>these, buggered if I know which.
>
>After that, Google is your friend. Look at the documentation to
>whichever parser you decided to use and use that - personally, I don't
>do much / no non-tree XML parsing at all, so I'm mainly guessing around
>on this. The main difference is that while with REXML, you can
>arbitrarily look around the XML document, with stream and pull parsing,
>you can only process the document in order, and have to keep the state
>of that processing (e.g. which track you're currently "working on") in
>your Ruby code.
>
>David Vallner
>

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.28/518 - Release Date: 11/4/2006

--
Mark Van Holstyn
mvanholstyn@gmail.com
http://lotswholetime.com

David_Vallner · 7 November 2006 00:08

James Edward Gray II wrote:

REXML includes a stream parser.

So it does, my bad.

David Vallner

pdg · 7 November 2006 02:10

Assuming I go with the Ruby pull parser, how do I use this in my code.
I see from the link the code sample, but I have no idea how to throw
that into my code and make it work. Any suggestions.

Thanks for the discussion so far.

PS: idiot (slaps head). Yes it was 5-6meg not gig!

David_Vallner · 7 November 2006 00:00

skt wrote:

I wish I had the foggiest idea of what you guys were talking about.
(Roobist here)
I'm still working on Y's book.

Wait... You chimed in on an unrelated thread with a "I don't understand
any of this, FYI" comment?!

The mind, it boggles.

For the record, this isn't a general chat channel. As such, derailing
threads is to be done more subtly

David Vallner

David_Vallner · 7 November 2006 02:24

pdg wrote:

Assuming I go with the Ruby pull parser, how do I use this in my code.
I see from the link the code sample, but I have no idea how to throw
that into my code and make it work. Any suggestions.

Generally, you should have some layer between XML input, and processing
the records themselves. E.g. a trivial Song class, or at least a hash.
Personally, I'd make a XMLSongList class that's enumerable (implements
#each), and rework the REXML code that works for small files into one
that yields a Song object for each of the records in succession by
querying the tree accordingly.

That shouldn't then be too hard to rework so that while #each is
running, it opens a pull parser, and for each yield, builds up a Song
object going through the record in the order how the elements appear in
the XML file, instead of a random one. Once you isolate the code that
manipulates the XML to the smallest significant unit (a song record in
this case, I presume), it shouldn't be conceptually that difficult to
rework from a tree parser to a pull parser. The code probably will get a
little messier and verbose, but the main shift of thinking is in not
asking the XML for what your object needs, but feeding an object what
the XML has.

PS: idiot (slaps head). Yes it was 5-6meg not gig!

6MB is still Huge (tm) for a XML file.

Chilkat_Software · 7 November 2006 03:15

I created a sample program to parse the iTunes XML using
Chilkat XML here:
http://www.example-code.com/ruby/ruby-parse-itunes-xml.asp

Unfortunately, it only runs on Windows. (Sorry!) It is freeware however.

Here's the example source. I suspect you won't have memory problems with it.
If you try it, please let me know how fast it runs and whether it uses
too much memory...

require 'chilkat'

# The Chilkat XML parser for Ruby is freeware.

xml = Chilkat::CkXml.new()
xml.LoadXmlFile("c:/temp/itunes.xml")

# Search for this node: <key>Tracks</key>
tracksKey = xml.SearchForContent(xml,"key","Tracks")

# Assuming it's found, the <dict> node is the next sibling
dict = tracksKey.NextSibling()

# Loop over the <dict> child nodes...
n = dict.NumChildrenHavingTag("dict")
for i in 0..(n-1)
         trackRec = dict.GetNthChildWithTag("dict",i)
         print "Name: " + trackRec.GetChildExact("key","Name").NextSibling().content + "\n"
         print "Artist: " + trackRec.GetChildExact("key","Artist").NextSibling().content + "\n"
         print "Time: " + trackRec.GetChildExact("key","Total Time").NextSibling().content + "\n"
end

-Matt

···

At 08:10 PM 11/6/2006, you wrote:

Assuming I go with the Ruby pull parser, how do I use this in my code.
I see from the link the code sample, but I have no idea how to throw
that into my code and make it work. Any suggestions.

Thanks for the discussion so far.

PS: idiot (slaps head). Yes it was 5-6meg not gig!

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.28/518 - Release Date: 11/4/2006

pdg · 7 November 2006 20:45

Hi David (or others)

I am still not sure I get it, could you explain a little more?

Thanks,
Paul.

David Vallner wrote:

···

pdg wrote:
> Assuming I go with the Ruby pull parser, how do I use this in my code.
> I see from the link the code sample, but I have no idea how to throw
> that into my code and make it work. Any suggestions.
>

Generally, you should have some layer between XML input, and processing
the records themselves. E.g. a trivial Song class, or at least a hash.
Personally, I'd make a XMLSongList class that's enumerable (implements
#each), and rework the REXML code that works for small files into one
that yields a Song object for each of the records in succession by
querying the tree accordingly.

That shouldn't then be too hard to rework so that while #each is
running, it opens a pull parser, and for each yield, builds up a Song
object going through the record in the order how the elements appear in
the XML file, instead of a random one. Once you isolate the code that
manipulates the XML to the smallest significant unit (a song record in
this case, I presume), it shouldn't be conceptually that difficult to
rework from a tree parser to a pull parser. The code probably will get a
little messier and verbose, but the main shift of thinking is in not
asking the XML for what your object needs, but feeding an object what
the XML has.

> PS: idiot (slaps head). Yes it was 5-6meg not gig!
>

6MB is still Huge (tm) for a XML file.

--------------enigF9D6B5236ACE2603700BC85A
Content-Type: application/pgp-signature
Content-Disposition: inline;
filename="signature.asc"
Content-Description: OpenPGP digital signature
X-Google-AttachSize: 188

Chilkat_Software · 7 November 2006 21:33

I tested the Chilkat XML parser (an in-memory DOM) on a 21MB XML file
that looks like this:

<phonebook>
<address><company>yuy25uiFfaku</company><street>A7ZbA3jP48rp</street><city>fSgWAhn3i3lD</city><state>p3rfNqf6kzUq</state><postal_code>lqVZ0b4daYWQ</postal_code><country>VjfXvb0AdxSt</country><extra>TEST</extra></address>
<address><company>Ki78Ypx8FlbZ</company><street>340PK6u2DsZQ</street><city>EqbFawBo0mCi</city><state>fTZK5YT0Tur8</state><postal_code>EXP29c5Hi2Hj</postal_code><country>sfGB4EzWR3Ft</country><extra>TEST</extra></address>
...
</phonebook>

(the data is random garbage...)

The XML is parsed in 11.5 seconds on a 18.Ghz Pentium 4. Peak memory usage is 180MB.
I don't think the parser would break a sweat on the 6MB file...

I uploaded the XML test data to: http://www.example-code.com/downloads/bigXml.zip
The code for parsing the iTunes XML is easy: http://www.example-code.com/ruby/ruby-parse-itunes-xml.asp

-Matt

···

At 02:45 PM 11/7/2006, you wrote:

Hi David (or others)

I am still not sure I get it, could you explain a little more?

Thanks,
Paul.

David Vallner wrote:
> pdg wrote:
> > Assuming I go with the Ruby pull parser, how do I use this in my code.
> > I see from the link the code sample, but I have no idea how to throw
> > that into my code and make it work. Any suggestions.
> >
>
> Generally, you should have some layer between XML input, and processing
> the records themselves. E.g. a trivial Song class, or at least a hash.
> Personally, I'd make a XMLSongList class that's enumerable (implements
> #each), and rework the REXML code that works for small files into one
> that yields a Song object for each of the records in succession by
> querying the tree accordingly.
>
> That shouldn't then be too hard to rework so that while #each is
> running, it opens a pull parser, and for each yield, builds up a Song
> object going through the record in the order how the elements appear in
> the XML file, instead of a random one. Once you isolate the code that
> manipulates the XML to the smallest significant unit (a song record in
> this case, I presume), it shouldn't be conceptually that difficult to
> rework from a tree parser to a pull parser. The code probably will get a
> little messier and verbose, but the main shift of thinking is in not
> asking the XML for what your object needs, but feeding an object what
> the XML has.
>
> > PS: idiot (slaps head). Yes it was 5-6meg not gig!
> >
>
> 6MB is still Huge (tm) for a XML file.
>
> --------------enigF9D6B5236ACE2603700BC85A
> Content-Type: application/pgp-signature
> Content-Disposition: inline;
> filename="signature.asc"
> Content-Description: OpenPGP digital signature
> X-Google-AttachSize: 188

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

pdg · 8 November 2006 02:25

Hi thanks for the link, it seems to be working much better, but...

It's getting to about the 1000th file and doing its job, but then
returning the following error:

undefined method 'NextSibling' for nil:NilClass (noMethodError) from
rtunes rb (which is basically just your sample code).

Does this mean I have a broken xml file? Or is something else the
matter?

Chilkat Software wrote:

···

I tested the Chilkat XML parser (an in-memory DOM) on a 21MB XML file
that looks like this:

<phonebook>
<address><company>yuy25uiFfaku</company><street>A7ZbA3jP48rp</street><city>fSgWAhn3i3lD</city><state>p3rfNqf6kzUq</state><postal_code>lqVZ0b4daYWQ</postal_code><country>VjfXvb0AdxSt</country><extra>TEST</extra></address>
<address><company>Ki78Ypx8FlbZ</company><street>340PK6u2DsZQ</street><city>EqbFawBo0mCi</city><state>fTZK5YT0Tur8</state><postal_code>EXP29c5Hi2Hj</postal_code><country>sfGB4EzWR3Ft</country><extra>TEST</extra></address>
..
</phonebook>

(the data is random garbage...)

The XML is parsed in 11.5 seconds on a 18.Ghz Pentium 4. Peak memory
usage is 180MB.
I don't think the parser would break a sweat on the 6MB file...

I uploaded the XML test data to:
http://www.example-code.com/downloads/bigXml.zip
The code for parsing the iTunes XML is easy:
http://www.example-code.com/ruby/ruby-parse-itunes-xml.asp

-Matt

At 02:45 PM 11/7/2006, you wrote:

>Hi David (or others)
>
>I am still not sure I get it, could you explain a little more?
>
>Thanks,
>Paul.
>
>David Vallner wrote:
> > pdg wrote:
> > > Assuming I go with the Ruby pull parser, how do I use this in my code.
> > > I see from the link the code sample, but I have no idea how to throw
> > > that into my code and make it work. Any suggestions.
> > >
> >
> > Generally, you should have some layer between XML input, and processing
> > the records themselves. E.g. a trivial Song class, or at least a hash.
> > Personally, I'd make a XMLSongList class that's enumerable (implements
> > #each), and rework the REXML code that works for small files into one
> > that yields a Song object for each of the records in succession by
> > querying the tree accordingly.
> >
> > That shouldn't then be too hard to rework so that while #each is
> > running, it opens a pull parser, and for each yield, builds up a Song
> > object going through the record in the order how the elements appear in
> > the XML file, instead of a random one. Once you isolate the code that
> > manipulates the XML to the smallest significant unit (a song record in
> > this case, I presume), it shouldn't be conceptually that difficult to
> > rework from a tree parser to a pull parser. The code probably will get a
> > little messier and verbose, but the main shift of thinking is in not
> > asking the XML for what your object needs, but feeding an object what
> > the XML has.
> >
> > > PS: idiot (slaps head). Yes it was 5-6meg not gig!
> > >
> >
> > 6MB is still Huge (tm) for a XML file.
> >
> >
> > --------------enigF9D6B5236ACE2603700BC85A
> > Content-Type: application/pgp-signature
> > Content-Disposition: inline;
> > filename="signature.asc"
> > Content-Description: OpenPGP digital signature
> > X-Google-AttachSize: 188
>
>
>
>
>
>--
>No virus found in this incoming message.
>Checked by AVG Free Edition.
>Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

Chilkat_Software · 8 November 2006 03:25

If you want, send me your test code + file and I'll have a look...

(tomorrow morning though)

Don't worry about a large attachment, just send it zipped...

Best Regards,
Matt

···

At 08:25 PM 11/7/2006, you wrote:

Hi thanks for the link, it seems to be working much better, but...

It's getting to about the 1000th file and doing its job, but then
returning the following error:

undefined method 'NextSibling' for nil:NilClass (noMethodError) from
rtunes rb (which is basically just your sample code).

Does this mean I have a broken xml file? Or is something else the
matter?

Chilkat Software wrote:
> I tested the Chilkat XML parser (an in-memory DOM) on a 21MB XML file
> that looks like this:
>
> <phonebook>
> <address><company>yuy25uiFfaku</company><street>A7ZbA3jP48rp</street><city>fSgWAhn3i3lD</city><state>p3rfNqf6kzUq</state><postal_code>lqVZ0b4daYWQ</postal_code><country>VjfXvb0AdxSt</country><extra>TEST</extra></address>
> <address><company>Ki78Ypx8FlbZ</company><street>340PK6u2DsZQ</street><city>EqbFawBo0mCi</city><state>fTZK5YT0Tur8</state><postal_code>EXP29c5Hi2Hj</postal_code><country>sfGB4EzWR3Ft</country><extra>TEST</extra></address>
> ..
> </phonebook>
>
> (the data is random garbage...)
>
> The XML is parsed in 11.5 seconds on a 18.Ghz Pentium 4. Peak memory
> usage is 180MB.
> I don't think the parser would break a sweat on the 6MB file...
>
> I uploaded the XML test data to:
> http://www.example-code.com/downloads/bigXml.zip
> The code for parsing the iTunes XML is easy:
> http://www.example-code.com/ruby/ruby-parse-itunes-xml.asp
>
> -Matt
>
> At 02:45 PM 11/7/2006, you wrote:
>
> >Hi David (or others)
> >
> >I am still not sure I get it, could you explain a little more?
> >
> >Thanks,
> >Paul.
> >
> >David Vallner wrote:
> > > pdg wrote:
> > > > Assuming I go with the Ruby pull parser, how do I use this in my code.
> > > > I see from the link the code sample, but I have no idea how to throw
> > > > that into my code and make it work. Any suggestions.
> > > >
> > >
> > > Generally, you should have some layer between XML input, and processing
> > > the records themselves. E.g. a trivial Song class, or at least a hash.
> > > Personally, I'd make a XMLSongList class that's enumerable (implements
> > > #each), and rework the REXML code that works for small files into one
> > > that yields a Song object for each of the records in succession by
> > > querying the tree accordingly.
> > >
> > > That shouldn't then be too hard to rework so that while #each is
> > > running, it opens a pull parser, and for each yield, builds up a Song
> > > object going through the record in the order how the elements appear in
> > > the XML file, instead of a random one. Once you isolate the code that
> > > manipulates the XML to the smallest significant unit (a song record in
> > > this case, I presume), it shouldn't be conceptually that difficult to
> > > rework from a tree parser to a pull parser. The code probably will get a
> > > little messier and verbose, but the main shift of thinking is in not
> > > asking the XML for what your object needs, but feeding an object what
> > > the XML has.
> > >
> > > > PS: idiot (slaps head). Yes it was 5-6meg not gig!
> > > >
> > >
> > > 6MB is still Huge (tm) for a XML file.
> > >
> > > --------------enigF9D6B5236ACE2603700BC85A
> > > Content-Type: application/pgp-signature
> > > Content-Disposition: inline;
> > > filename="signature.asc"
> > > Content-Description: OpenPGP digital signature
> > > X-Google-AttachSize: 188
> >
> >--
> >No virus found in this incoming message.
> >Checked by AVG Free Edition.
> >Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

pdg · 8 November 2006 03:30

Figured it out.

It was running into problems with videos and podcasts as they don't
have an artist tag.
Got round it by searching for children tags with the title of "Movie"
or "Podcast", and not running the print statements for those items.

pdg wrote:

···

Hi thanks for the link, it seems to be working much better, but...

It's getting to about the 1000th file and doing its job, but then
returning the following error:

undefined method 'NextSibling' for nil:NilClass (noMethodError) from
rtunes rb (which is basically just your sample code).

Does this mean I have a broken xml file? Or is something else the
matter?

Chilkat Software wrote:
> I tested the Chilkat XML parser (an in-memory DOM) on a 21MB XML file
> that looks like this:
>
> <phonebook>
> <address><company>yuy25uiFfaku</company><street>A7ZbA3jP48rp</street><city>fSgWAhn3i3lD</city><state>p3rfNqf6kzUq</state><postal_code>lqVZ0b4daYWQ</postal_code><country>VjfXvb0AdxSt</country><extra>TEST</extra></address>
> <address><company>Ki78Ypx8FlbZ</company><street>340PK6u2DsZQ</street><city>EqbFawBo0mCi</city><state>fTZK5YT0Tur8</state><postal_code>EXP29c5Hi2Hj</postal_code><country>sfGB4EzWR3Ft</country><extra>TEST</extra></address>
> ..
> </phonebook>
>
> (the data is random garbage...)
>
> The XML is parsed in 11.5 seconds on a 18.Ghz Pentium 4. Peak memory
> usage is 180MB.
> I don't think the parser would break a sweat on the 6MB file...
>
> I uploaded the XML test data to:
> http://www.example-code.com/downloads/bigXml.zip
> The code for parsing the iTunes XML is easy:
> http://www.example-code.com/ruby/ruby-parse-itunes-xml.asp
>
> -Matt
>
>
>
>
> At 02:45 PM 11/7/2006, you wrote:
>
> >Hi David (or others)
> >
> >I am still not sure I get it, could you explain a little more?
> >
> >Thanks,
> >Paul.
> >
> >David Vallner wrote:
> > > pdg wrote:
> > > > Assuming I go with the Ruby pull parser, how do I use this in my code.
> > > > I see from the link the code sample, but I have no idea how to throw
> > > > that into my code and make it work. Any suggestions.
> > > >
> > >
> > > Generally, you should have some layer between XML input, and processing
> > > the records themselves. E.g. a trivial Song class, or at least a hash.
> > > Personally, I'd make a XMLSongList class that's enumerable (implements
> > > #each), and rework the REXML code that works for small files into one
> > > that yields a Song object for each of the records in succession by
> > > querying the tree accordingly.
> > >
> > > That shouldn't then be too hard to rework so that while #each is
> > > running, it opens a pull parser, and for each yield, builds up a Song
> > > object going through the record in the order how the elements appear in
> > > the XML file, instead of a random one. Once you isolate the code that
> > > manipulates the XML to the smallest significant unit (a song record in
> > > this case, I presume), it shouldn't be conceptually that difficult to
> > > rework from a tree parser to a pull parser. The code probably will get a
> > > little messier and verbose, but the main shift of thinking is in not
> > > asking the XML for what your object needs, but feeding an object what
> > > the XML has.
> > >
> > > > PS: idiot (slaps head). Yes it was 5-6meg not gig!
> > > >
> > >
> > > 6MB is still Huge (tm) for a XML file.
> > >
> > >
> > > --------------enigF9D6B5236ACE2603700BC85A
> > > Content-Type: application/pgp-signature
> > > Content-Disposition: inline;
> > > filename="signature.asc"
> > > Content-Description: OpenPGP digital signature
> > > X-Google-AttachSize: 188
> >
> >
> >
> >
> >
> >--
> >No virus found in this incoming message.
> >Checked by AVG Free Edition.
> >Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.1.409 / Virus Database: 268.13.31/522 - Release Date: 11/7/2006

Topic		Replies	Views
One more way to parse XML ruby-talk	5	135	24 October 2006
Rexml - StreamListener - Where I am in the XML? ruby-talk	3	116	22 February 2007
Parsing iTunes Libary ruby-talk	5	63	21 July 2007
Stream Parsing with REXML ruby-talk	12	101	14 January 2008
REXML ... performance & memory usage ruby-talk	13	99	9 November 2006

Rexml

Related topics