I have a script that uses REXML to stream parse an XML file and load a
MySQL database from it. The file is about 65000 lines, 2.2 MB. With Ruby
1.8.0, it completes in about 25 seconds. I just installed Ruby 1.8.1,
and the performance is at least a factor of 20 worse.
This is on a 1 GHz P3 running Gentoo Linux. I don’t have head-to-head
comparison numbers, but I’m observing the same effect on a 3.2 GHz P4
running RedHat 9. The script runs for 16 minutes when it should have
completed in a few seconds.
I’ve commented out the database calls, so I’m just measuring the
Ruby/REXML performance.
def initialize
end
def tag_start(name, attrs)
end
def tag_end(name)
end
def text(t)
end
def cdata(d)
end
end
REXML::Document.parse_stream($stdin, Handler.new)
I’m running the tests on two different machines because it’s a pain to
switch minor versions on the same machine. But I have run tests on the
same machine to verify the results.
Quick update–I’ll post numbers later. It appears that the REXML stream
parser in Ruby 1.8.1 is worse than O(n) in the number of input lines (or
tags). Ruby 1.8.0 appears to be pretty close to O(n) (as it should).
I hope soon I won’t be the only person in this thread :-).
Quick update–I’ll post numbers later. It appears that the REXML stream
parser in Ruby 1.8.1 is worse than O(n) in the number of input lines (or
tags). Ruby 1.8.0 appears to be pretty close to O(n) (as it should).
I hope soon I won’t be the only person in this thread :-).
Steve
Steve,
One thing you can do is run your code through the profiler (ruby -r
profile yourcode.rb) and show us the results. This could help us
identify a potential bottleneck in REXML and/or Ruby itself.
Sorry for the delay. I’ve been really distracted for the past few
months with this and that. I’ll check your workaround, and your bug
report. I’ve got a ton of work to do with REXML; one reason I’ve been
taking so long to fix things is that the performance issue with 1.8.x
is really bothering me; the memory usage of REXML is much worse under
1.8 than it is under 1.6, and I’ve been blocked trying to figure that
out.
Anyway, sorry – to everybody – for being such a laggard. Things
should be better now.
One thing you can do is run your code through the profiler (ruby -r
profile yourcode.rb) and show us the results.
I’ve been running lots of tests. Unfortunately, the profiler adds a
large O(n) component to the runtime, so the nonlinear behavior doesn’t
emerge until larger n. (n is the number of tags in the input XML file.)
I have a lot of data now and I’ll have to organize it a little. The
method call counts look fine–they are all linear in n (with a small
constant offset.) One thing that does jump out, however, is that as n
goes from 1000 to 50000, the time per call for String#strip goes
increases by factor of about 10. For n = 1000, String#strip accounted
for about 6% of the total runtime. For n = 50000, it’s about 30%.
(Without the profiler, the percentages would be even higher.) That must
mean String#strip is being passed longer arguments for larger n.
String#strip is called exactly the same number of times as:
Sorry for the delay. I’ve been really distracted for the past few
months with this and that.
I know how that goes. Fortunately, Zachary Landau already had a
workaround; I just had to discover it.
Good luck with the 1.8 work. I use REXML in a couple of simple but
important applications. I’m happy with it overall. Performance, as you
note, could be better, but it’s not a show-stopper.
I’ve repeated all the tests with Ruby 1.8.0 and it sure looks like
String#strip is the problem.
To summarize, I have a bare REXML stream parser that reads its input
file and does nothing. I’m handing it an input file that looks like this
<?xml version="1.0" ?>
0
1
...
I’m varying the number of elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).
Here’s the profile data for strip over all values of n:
In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.
I don’t suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn’t
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear–unless the buffer is growing.)
I hope this is enough for someone to diagnose and fix the problem. I’ll
send the data to anyone who wants it.
Good luck with the 1.8 work. I use REXML in a couple of simple but
important applications. I’m happy with it overall. Performance, as you
note, could be better, but it’s not a show-stopper.
Yeah, well… I think REXML is at the point where I’m squeezing
optimizations out of it. Many of the optimizations I’ve tried have
made things worse, and I end up backing them out. I’m not out of
ideas yet, but I doubt that we’ll ever see an order of magnitude
improvement, until I write an optional back-end that makes use of a
native parsing library.
One thing you can do, though, is drop out of the advanced API and use
the light parser. The light parser API is pretty limited, at the
moment; it’ll be a lot more useful when I get it hooked into XPath,
but it should nearly double the speed of parsing. It does this by
eschewing the heavy, traditional, REXML tree objects. The price is
that you don’t get all of the neat helper functions; at that point,
you’re dealing with pretty raw data. The API isn’t well documented
yet, but it works.
I'm varying the number of <response> elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).
Here's the profile data for strip over all values of n:
In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.
I don't suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn't
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear--unless the buffer is growing.)
I hope this is enough for someone to diagnose and fix the problem. I'll
send the data to anyone who wants it.
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.
I’ve repeated all the tests with Ruby 1.8.0 and it sure looks like
String#strip is the problem.
To summarize, I have a bare REXML stream parser that reads its input
file and does nothing. I’m handing it an input file that looks like this
<?xml version="1.0" ?>
0
1
...
I’m varying the number of elements in the file from 100 to
50000 and calling that n. In Ruby 1.8.0, the runtime is roughly
proportional to n, as it should be. In 1.8.1, the total runtime is
greater (which is bad but not pathological) and worse than linear in n
(which is pathological).
Here’s the profile data for strip over all values of n:
In both cases, the number of calls is proportional to n, although strip
is called 2.5 times more often in 1.8.0 than 1.8.1. In 1.8.1 (only),
there is a strong increasing trend in the time per call as n increases.
I don’t suspect the problem is in strip itself. I suspect the problem is
that REXML creates a string that grows as more input is read and hands
it repeatedly to strip. I looked briefly at the REXML code and didn’t
see anything obvious. (Other than the fact that calling strip on a
buffer just to see whether it has any tokens in it is inefficient. That
just makes it slow, however, not nonlinear–unless the buffer is growing.)
I hope this is enough for someone to diagnose and fix the problem. I’ll
send the data to anyone who wants it.
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.
Yes, no doubt, i’m not an open source fan.
Tho i think this thread would be more relevant on the rexml ml, I have
been reading the thread and it does interest me (I’ve not been
responding because i prob could not help). I think you’ve been dilegent
and helpful in pointing out either a bug or performance hit. I did not
think you were asking for anything for free or whatever - besides, to
me, that’s what these ml’s are for, in any case. Good luck in
finding/helping find the culprit for the performance problem.
You are the only one talking with you in this thread.
Other people are reading it. You read it. It gets archived.
Do you really think that someone else will do your job ? Do it for free ?
It’s not my job. I observed something interesting and I reported it.
Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.
Thanks for the advice. I reported this problem Saturday morning US
Pacific time. It is now Sunday morning. In the intervening day others
and I have (1) confirmed my observations, (2) carried on an off-list
discussion of the problem, and (3) applied a patch developed by kapheine
that completely solves the problem.
Open source == Learn it yourself by trying and get it right if you can,
with as much help as others are willing to give you in your pursuit.
That is one reason why I’ve loved Ruby so much. The community is a
community, not just a collection of individuals with similar traits.
I don’t think I have a clue about programming yet, but there are those who
are willing to help me get a clue.
I will forever be grateful to those who help - no matter how large or small
their help might end up being.
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ? Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise
repeat
the same thing every few month as long as you need this library.
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ?
Hmm, that’s the third or fifth message I am reading from you today,
and they all were negative and/or rude. You don’t seem to like Ruby or
its community much. Someone asked you, in another thread, if you are a
troll. I believe the question was on target.
Yes, no doubt, i’m not an open source fan.
Quite apparently not one of decent manners, either.
Well done. I was just about to send a private note thanking you for
your diligence and good investigation and reporting. Lothar’s poor
public comment, however, has compelled me to make my comment public.
So thank you
Cheers,
Gavin
···
On Monday, February 2, 2004, 6:29:30 AM, Steven wrote:
Welcome to the
world of OpenSource Software (call it non active supported software),
write a personal email to the REXML author hope he respond, otherwise
fix it and hope he adds it to the code, otherwise publish your source
patch and hope more people use it then the original branch, otherwise repeat
the same thing every few month as long as you need this library.
Thanks for the advice. I reported this problem Saturday morning US
Pacific time. It is now Sunday morning. In the intervening day others
and I have (1) confirmed my observations, (2) carried on an off-list
discussion of the problem, and (3) applied a patch developed by kapheine
that completely solves the problem.
Open Source is wonderful. It allows folks to focus on the best possible
service they can offer and not on “mine mine mine” concept that children at
the age of 2 get so well. Perhaps that is my new fav analogy of Open Source
vs. Closed Source developers. Open Sourcer’s are compared to children older
then that of 2 who have learned to share, while Closed Source developers can
be compared to children at the age of 2. I wonder what’ll happen when
developers reach adolesence. =)
Zach
···
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.576 / Virus Database: 365 - Release Date: 1/30/2004
Here’s something I never thought I’d get from working on an O/S project:
I’m in the U.S., but I’ve communicated via email with Ruby users in Spain,
Germany, England, Austrailia, and Japan. We’re not just a community, but a global community.
···
On Mon, 02 Feb 2004 07:12:26 +0900, Rich wrote:
That is one reason why I’ve loved Ruby so much. The community is a
community, not just a collection of individuals with similar traits.
Quoteing mvondung@gmx.net, on Wed, Feb 04, 2004 at 12:35:02AM +0900:
You are the only one talking with you in this thread. Do you really
think that someone else will do your job ? Do it for free ?
Hmm, that’s the third or fifth message I am reading from you today,
and they all were negative and/or rude. You don’t seem to like Ruby or
its community much. Someone asked you, in another thread, if you are a
troll. I believe the question was on target.
Yes, no doubt, i’m not an open source fan.
Which is particularly weird, given that Lothar is the principle of a
company which claims to be developing IDEs for open source scripting
languages (betas expected at end of the year…)!