Ruby 1.8.1 REXML performance


(Steven Jenkins) #1

I have a script that uses REXML to stream parse an XML file and load a
MySQL database from it. The file is about 65000 lines, 2.2 MB. With Ruby
1.8.0, it completes in about 25 seconds. I just installed Ruby 1.8.1,
and the performance is at least a factor of 20 worse.

This is on a 1 GHz P3 running Gentoo Linux. I don’t have head-to-head
comparison numbers, but I’m observing the same effect on a 3.2 GHz P4
running RedHat 9. The script runs for 16 minutes when it should have
completed in a few seconds.

I’ve commented out the database calls, so I’m just measuring the
Ruby/REXML performance.

Any ideas?


(Steven Jenkins) #2

Here’s a little more data. I’ve stripped the application down to a
bare-bones stream parser:

···

#!/usr/bin/env ruby

require "rexml/document"
require “rexml/streamlistener”

class Handler
include REXML::StreamListener

 def initialize
 end

 def tag_start(name, attrs)
 end

 def tag_end(name)
 end

 def text(t)
 end

 def cdata(d)
 end

end

REXML::Document.parse_stream($stdin, Handler.new)

I’m running the tests on two different machines because it’s a pain to
switch minor versions on the same machine. But I have run tests on the
same machine to verify the results.

To parse a 65,000 line XML file:

machine1 machine2
1 GHz 3.2 GHz
Ruby 1.8.0 Ruby 1.8.1
17 seconds 515 seconds

Steve


(Steven Jenkins) #3

Quick update–I’ll post numbers later. It appears that the REXML stream
parser in Ruby 1.8.1 is worse than O(n) in the number of input lines (or
tags). Ruby 1.8.0 appears to be pretty close to O(n) (as it should).

I hope soon I won’t be the only person in this thread :-).

Steve


(Steven Jenkins) #4

For the record, here’s a workaround:

http://butter.homeunix.net/~kapheine/source.rb

Here’s the bug report:

http://www.ruby-lang.org/cgi-bin/ruby-bugs/incoming?id=1271;expression=rexml;page=1;user=guest

Thanks to all who helped on IRC.

Steve


(Daniel Berger) #5

Steven Jenkins steven.jenkins@ieee.org wrote in message news:401C268F.80805@ieee.org

Quick update–I’ll post numbers later. It appears that the REXML stream
parser in Ruby 1.8.1 is worse than O(n) in the number of input lines (or
tags). Ruby 1.8.0 appears to be pretty close to O(n) (as it should).

I hope soon I won’t be the only person in this thread :-).

Steve

Steve,

One thing you can do is run your code through the profiler (ruby -r
profile yourcode.rb) and show us the results. This could help us
identify a potential bottleneck in REXML and/or Ruby itself.

Regards,

Dan


(SER) #6

Steven Jenkins steven.jenkins@ieee.org wrote in message news:401D5E09.9030401@ieee.org

For the record, here’s a workaround:

http://butter.homeunix.net/~kapheine/source.rb

Here’s the bug report:

http://www.ruby-lang.org/cgi-bin/ruby-bugs/incoming?id=1271;expression=rexml;page=1;user=guest

Steve,

Sorry for the delay. I’ve been really distracted for the past few
months with this and that. I’ll check your workaround, and your bug
report. I’ve got a ton of work to do with REXML; one reason I’ve been
taking so long to fix things is that the performance issue with 1.8.x
is really bothering me; the memory usage of REXML is much worse under
1.8 than it is under 1.6, and I’ve been blocked trying to figure that
out.

Anyway, sorry – to everybody – for being such a laggard. Things
should be better now.

— SER


(Steven Jenkins) #7

Daniel Berger wrote:

One thing you can do is run your code through the profiler (ruby -r
profile yourcode.rb) and show us the results.

I’ve been running lots of tests. Unfortunately, the profiler adds a
large O(n) component to the runtime, so the nonlinear behavior doesn’t
emerge until larger n. (n is the number of tags in the input XML file.)

I have a lot of data now and I’ll have to organize it a little. The
method call counts look fine–they are all linear in n (with a small
constant offset.) One thing that does jump out, however, is that as n
goes from 1000 to 50000, the time per call for String#strip goes
increases by factor of about 10. For n = 1000, String#strip accounted
for about 6% of the total runtime. For n = 50000, it’s about 30%.
(Without the profiler, the percentages would be even higher.) That must
mean String#strip is being passed longer arguments for larger n.

String#strip is called exactly the same number of times as:

REXML::Parsers::BaseParser#pull
REXML::Parsers::BaseParser#has_next?
REXML::Parsers::BaseParser#empty?
REXML::IOSource#empty?
REXML::Source#empty?

IOSource#empty? and Source#empty both call String#strip directly, so I
suspect that’s where the problem is.

Steve


(Steven Jenkins) #8

Sean Russell wrote:

Sorry for the delay. I’ve been really distracted for the past few
months with this and that.

I know how that goes. Fortunately, Zachary Landau already had a
workaround; I just had to discover it.

Good luck with the 1.8 work. I use REXML in a couple of simple but
important applications. I’m happy with it overall. Performance, as you
note, could be better, but it’s not a show-stopper.

Steve


(Steven Jenkins) #9

I’ve repeated all the tests with Ruby 1.8.0 and it sure looks like
String#strip is the problem.

To summarize, I have a bare REXML stream parser that reads its input
file and does nothing. I’m handing it an input file that looks like this

<?xml version="1.0" ?> 0 1 ...

I’m varying the number of elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).

Here’s the profile data for strip over all values of n:

            time/call     % total runtime
           min    max     min      max

Ruby 1.8.0 .01 .02 .82 2.48
Ruby 1.8.1 .10 1.21 5.08 29.04

In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.

I don’t suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn’t
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear–unless the buffer is growing.)

I hope this is enough for someone to diagnose and fix the problem. I’ll
send the data to anyone who wants it.

Steve


(SER) #10

Steven Jenkins steven.jenkins@ieee.org wrote in message news:402B8EFE.6000202@ieee.org

Good luck with the 1.8 work. I use REXML in a couple of simple but
important applications. I’m happy with it overall. Performance, as you
note, could be better, but it’s not a show-stopper.

Yeah, well… I think REXML is at the point where I’m squeezing
optimizations out of it. Many of the optimizations I’ve tried have
made things worse, and I end up backing them out. I’m not out of
ideas yet, but I doubt that we’ll ever see an order of magnitude
improvement, until I write an optional back-end that makes use of a
native parsing library.

One thing you can do, though, is drop out of the advanced API and use
the light parser. The light parser API is pretty limited, at the
moment; it’ll be a lot more useful when I get it hooked into XPath,
but it should nearly double the speed of parsing. It does this by
eschewing the heavy, traditional, REXML tree objects. The price is
that you don’t get all of the neat helper functions; at that point,
you’re dealing with pretty raw data. The API isn’t well documented
yet, but it works.

— SER


(Lothar Scholz) #11

Hello Steven,

Sunday, February 1, 2004, 6:17:27 PM, you wrote:

I've repeated all the tests with Ruby 1.8.0 and it sure looks like
String#strip is the problem.

To summarize, I have a bare REXML stream parser that reads its input
file and does nothing. I'm handing it an input file that looks like this

<?xml version="1.0" ?>
<multistatus>
<response>
<dsref>0</dsref>
</response>
<response>
<dsref>1</dsref>
</response>
...
</multistatus>

I'm varying the number of <response> elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).

Here's the profile data for strip over all values of n:

                time/call % total runtime
               min max min max

Ruby 1.8.0 .01 .02 .82 2.48
Ruby 1.8.1 .10 1.21 5.08 29.04

In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.

I don't suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn't
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear--unless the buffer is growing.)

I hope this is enough for someone to diagnose and fix the problem. I'll
send the data to anyone who wants it.

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Yes, no doubt, i'm not an open source fan.

···

--
Best regards,
Lothar mailto:mailinglists@scriptolutions.com


(Martin S. Weber) #12

(…)
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ?

His job ? it’s part of the base system after all.

Do it for free ?

So we should ask for money if you come here and ask for help ?

(… brainless babble stripped …)

Yes, no doubt, i’m not an open source fan.

Go use Java on solaris then. (http://www.archub.org/javamemo.txt).
And pay Matz for ruby.

Best regards,

-Martin

···

On Mon, Feb 02, 2004 at 04:01:09AM +0900, Lothar Scholz wrote:


(Paul Vudmaska) #13

Lothar Scholz wrote:

Hello Steven,

Sunday, February 1, 2004, 6:17:27 PM, you wrote:

I’ve repeated all the tests with Ruby 1.8.0 and it sure looks like
String#strip is the problem.

To summarize, I have a bare REXML stream parser that reads its input
file and does nothing. I’m handing it an input file that looks like this

<?xml version="1.0" ?> 0 1 ...

I’m varying the number of elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).

Here’s the profile data for strip over all values of n:

            time/call     % total runtime
           min    max     min      max

Ruby 1.8.0 .01 .02 .82 2.48
Ruby 1.8.1 .10 1.21 5.08 29.04

In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.

I don’t suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn’t
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear–unless the buffer is growing.)

I hope this is enough for someone to diagnose and fix the problem. I’ll
send the data to anyone who wants it.

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Yes, no doubt, i’m not an open source fan.

Tho i think this thread would be more relevant on the rexml ml, I have
been reading the thread and it does interest me (I’ve not been
responding because i prob could not help). I think you’ve been dilegent
and helpful in pointing out either a bug or performance hit. I did not
think you were asking for anything for free or whatever - besides, to
me, that’s what these ml’s are for, in any case. Good luck in
finding/helping find the culprit for the performance problem.


(Steven Jenkins) #14

Lothar Scholz wrote:

You are the only one talking with you in this thread.

Other people are reading it. You read it. It gets archived.

Do you really think that someone else will do your job ? Do it for free ?

It’s not my job. I observed something interesting and I reported it.

Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Thanks for the advice. I reported this problem Saturday morning US
Pacific time. It is now Sunday morning. In the intervening day others
and I have (1) confirmed my observations, (2) carried on an off-list
discussion of the problem, and (3) applied a patch developed by kapheine
that completely solves the problem.

And we all did it for free. It was even fun.

Yes, no doubt, i’m not an open source fan.

So what?

Steve


(Rich) #15

Open source == Learn it yourself by trying and get it right if you can,
with as much help as others are willing to give you in your pursuit.

That is one reason why I’ve loved Ruby so much. The community is a
community, not just a collection of individuals with similar traits.

I don’t think I have a clue about programming yet, but there are those who
are willing to help me get a clue.

I will forever be grateful to those who help - no matter how large or small
their help might end up being.

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise
repeat
the same thing every few month as long as you need this library.

Yes, no doubt, i’m not an open source fan.

No doubt I am an open source and Ruby fan.

-Rich


(Michael Vondung) #16

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ?

Hmm, that’s the third or fifth message I am reading from you today,
and they all were negative and/or rude. You don’t seem to like Ruby or
its community much. Someone asked you, in another thread, if you are a
troll. I believe the question was on target.

Yes, no doubt, i’m not an open source fan.

Quite apparently not one of decent manners, either.

M.

···

On Mon, 2 Feb 2004 04:01:09 +0900, Lothar Scholz mailinglists@scriptolutions.com wrote:


(Gavin Sinclair) #17

Well done. I was just about to send a private note thanking you for
your diligence and good investigation and reporting. Lothar’s poor
public comment, however, has compelled me to make my comment public.

So thank you :slight_smile:

Cheers,
Gavin

···

On Monday, February 2, 2004, 6:29:30 AM, Steven wrote:

Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.

Thanks for the advice. I reported this problem Saturday morning US
Pacific time. It is now Sunday morning. In the intervening day others
and I have (1) confirmed my observations, (2) carried on an off-list
discussion of the problem, and (3) applied a patch developed by kapheine
that completely solves the problem.


(Zach Dennis) #18

Open Source is wonderful. It allows folks to focus on the best possible
service they can offer and not on “mine mine mine” concept that children at
the age of 2 get so well. Perhaps that is my new fav analogy of Open Source
vs. Closed Source developers. Open Sourcer’s are compared to children older
then that of 2 who have learned to share, while Closed Source developers can
be compared to children at the age of 2. I wonder what’ll happen when
developers reach adolesence. =)

Zach

···

Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.576 / Virus Database: 365 - Release Date: 1/30/2004


(Tim Hunter) #19

Here’s something I never thought I’d get from working on an O/S project:
I’m in the U.S., but I’ve communicated via email with Ruby users in Spain,
Germany, England, Austrailia, and Japan. We’re not just a community, but a
global community.

···

On Mon, 02 Feb 2004 07:12:26 +0900, Rich wrote:

That is one reason why I’ve loved Ruby so much. The community is a
community, not just a collection of individuals with similar traits.


(Sam Roberts) #20

Quoteing mvondung@gmx.net, on Wed, Feb 04, 2004 at 12:35:02AM +0900:

You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ?

Hmm, that’s the third or fifth message I am reading from you today,
and they all were negative and/or rude. You don’t seem to like Ruby or
its community much. Someone asked you, in another thread, if you are a
troll. I believe the question was on target.

Yes, no doubt, i’m not an open source fan.

Which is particularly weird, given that Lothar is the principle of a
company which claims to be developing IDEs for open source scripting
languages (betas expected at end of the year…)!

Like Ruby:

http://www.ruby-ide.com/arachno_ruby.php

He’s not much of a salesman for himself, I’d say!

Cheers,
Sam

···

On Mon, 2 Feb 2004 04:01:09 +0900, Lothar Scholz mailinglists@scriptolutions.com wrote: