Speed up suggestions

Thanks all for speedup tips.

I have tried all of them and the fastest one is attached.
Results:
ruby : 115 sec → 62 sec (wow :slight_smile:
python : 60 sec → 53 sec

Ruby speedup is really impressive. Half of the improvement is caused by using
regex to eliminate comment lines and half by different way for extracting second field.

I have also attached a snippet from my web.log file.
Information contained is faked but the structure keeps the same.

This solution seems to be fastest (or one close to the fastest) to get lines having 2nd field
satisfying the condition.

But there could be generally conditions put on more fields. What then?
Use String.split to get the fields and then match single fields or build a all_in_one regex and
try to match the whole line?

<snippet_1>
pom = line.split( “\t” )
if pom[1] =~ /expr_to_match_1/ and pom[3] =~ /expr_to_match_2/ and …
do_something
end
</snippet_1>

OR?

<snippet_2>
if line =~ /expr_to_match_1 … expr_to_match_2 … expr_to_match_3/
do_something
end
</snippet_2>

Thanks
Tom

parse.rb (392 Bytes)

web.log (1.71 KB)

parse.py (572 Bytes)

···

— Joseph McDonald joe@vpop.net wrote:

can you give a few examples of the lines in the logfile?

thanks,
-joe

Friday, September 20, 2002, 9:22:34 AM, you wrote:

Hello,

I have written just a simple script to analyze a log file and (just for fun) I have written
exactly the same in python to see the difference and …
python is almost twice the faster doing the same job :expressionless: (???)
You can see attached files for sources.
Environment: P4 1.8Ghz, 256MB, WinXP Pro. Python 2.2.1, Ruby 1.7.2-4 - the Pragmatic
distribution.
The analyzed file is about 420 Mbytes and python does it in about 60 sec and ruby in about
115
sec.
Have some suggestion how to speed the ruby code?

Regards

Tom


Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com


Best regards,
Joseph mailto:joe@vpop.net


Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!

Thanks all for speedup tips.

I have tried all of them and the fastest one is attached.
Results:
ruby : 115 sec → 62 sec (wow :slight_smile:
python : 60 sec → 53 sec

Are you sure you included the right script? I just expanded your weblog
to 4.5 megs and ran the script you included and got a time of 80
seconds (via time). I then added the rest of the changes I suggested
and ran my version in 26 seconds (via time). Included below are the
scripts and the output. The changes were:

  1. get rid of all calls to index.
  2. put all regexen inline.

But there could be generally conditions put on more fields. What then?
Use String.split to get the fields and then match single fields or
build a all_in_one regex and try to match the whole line?

That really depends. If you can order the conditions to exclusion in
such a way that you can avoid the split, you probably want to go that
way. But I’d just measure and see.

parse.rb.orig (373 Bytes)

parse.rb.time (1.18 KB)

parse.rb (309 Bytes)

parse.rb.orig.time (1.56 KB)

···

On Monday, September 23, 2002, at 12:39 AM, Tomas Brixi wrote:

Well… yes and no :wink:
I looked back to code you send on Saturday and I noticed I just copied it without changing
separators : to \t and so it split nothing and run much longer.

Now I have run your script with inline regexen and it runs about 60 secs which is slightly better
than using index method but the code is more clear.

Tom

···

— Ryan Davis ryand@zenspider.com wrote:

On Monday, September 23, 2002, at 12:39 AM, Tomas Brixi wrote:

Thanks all for speedup tips.

I have tried all of them and the fastest one is attached.
Results:
ruby : 115 sec → 62 sec (wow :slight_smile:
python : 60 sec → 53 sec

Are you sure you included the right script? I just expanded your weblog
to 4.5 megs and ran the script you included and got a time of 80
seconds (via time). I then added the rest of the changes I suggested
and ran my version in 26 seconds (via time). Included below are the
scripts and the output. The changes were:

  1. get rid of all calls to index.
  2. put all regexen inline.

But there could be generally conditions put on more fields. What then?
Use String.split to get the fields and then match single fields or
build a all_in_one regex and try to match the whole line?

That really depends. If you can order the conditions to exclusion in
such a way that you can avoid the split, you probably want to go that
way. But I’d just measure and see.

ATTACHMENT part 2 application/octet-stream x-unix-mode=0644; name=parse.rb.orig

ATTACHMENT part 3 application/octet-stream x-unix-mode=0664; name=parse.rb.time

ATTACHMENT part 4 application/octet-stream x-unix-mode=0644; name=parse.rb

ATTACHMENT part 5 application/octet-stream x-unix-mode=0664; name=parse.rb.orig.time


Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!

In my experience, Python is faster than Ruby. I made a small script to
help me do crossword puzzles. The first thing I do, is open a
dictionary file and put it in memory:

Ruby: _wordList = open(’/home/vince/fr_dict.txt’, ‘r’).read.split(’\n’)
Python: _wordList = (open(’/home/vince/fr_dict.txt’, ‘r’).read()).split(’\n’)

Same thing really. Here are the outputs of time:
Ruby:
ruby/crosswords.rb 1.52s user 0.13s system 68% cpu 2.407 total

Python:
python/crossword.py 0.52s user 0.07s system 31% cpu 1.875 total

Python went over that file much more quickly. Maybe my technique for
opening a file in Ruby could be accelerated, but I think that the speed
of the Python interpreter is just greater than that of Ruby, that’s all.

Vince

···

Vincent Foley-Bourgon
Email: vinfoley@iquebec.com
Homepage: http://darkhost.mine.nu:81

Ruby: _wordList = open(‘/home/vince/fr_dict.txt’, ‘r’).read.split(‘\n’)
Python: _wordList = (open(‘/home/vince/fr_dict.txt’, ‘r’).read()).split(‘\n’)
[…]
ruby/crosswords.rb 1.52s user 0.13s system 68% cpu 2.407 total
[…]
python/crossword.py 0.52s user 0.07s system 31% cpu 1.875 total

Yes, looks like python is faster than ruby, but it’s possible to
speed-up both - python and ruby in your task, by using readlines
method. Take a look (i used /etc/termcap-BSD instead of your
fr_dict.txt) :wink:

% cat py.py
blah = open(‘/etc/termcap-BSD’, ‘r’).readlines()
% cat peigrek.py
blah = open(‘/etc/termcap-BSD’, ‘r’).read().split(‘\n’)
% cat rb.rb
blah = open(‘/etc/termcap-BSD’, ‘r’).readlines
% cat ryby.rb
blah = open(‘/etc/termcap-BSD’, ‘r’).read.split(‘\n’)

So, I’m assuming that py.py and rb.rb should work faster than
peigrek.py and ryby.rb, because they’re using readlines instead
of read and split methods. Let’s take a look at the results:

% time python py.py; time python peigrek.py; time ruby rb.rb; time ruby ryby.rb
python py.py 0.23s user 0.05s system 132% cpu 0.212 total
python peigrek.py 0.23s user 0.06s system 127% cpu 0.227 total

py.py is faster than peigrek.py - good

ruby rb.rb 0.74s user 0.06s system 108% cpu 0.735 total
ruby ryby.rb 0.84s user 0.08s system 109% cpu 0.841 total

rb.rb is faster than ryby.rb

It’s hard to ommit that python is much faster than ruby… :confused:

% time ruby rb.rb; time ruby ryby.rb; time python py.py; time python peigrek.py
ruby rb.rb 0.78s user 0.04s system 101% cpu 0.808 total
ruby ryby.rb 0.87s user 0.04s system 105% cpu 0.861 total

The same - rb.rb is faster than ryby.rb

python py.py 0.24s user 0.04s system 126% cpu 0.222 total
python peigrek.py 0.22s user 0.07s system 128% cpu 0.225 total

And same here.

So, we can speed both up, but even then python is much, much faster… ;]

Any ideas to speed it up even more ? :slight_smile:

ps. anyway, hi, I’m quite new here, since I subsribed to the
list yesterday :wink:

···

On Mon, Sep 23, 2002 at 11:44:30PM +0900, Vincent Foley wrote:


[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

Looks like

File.readlines(filename)

is a bit faster than ways with open/readlines or open/read/split.

···

On Mon, Sep 23, 2002 at 11:44:30PM +0900, Vincent Foley wrote:

Python went over that file much more quickly. Maybe my technique for
opening a file in Ruby could be accelerated,

[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

Ruby: _wordList = open(‘/home/vince/fr_dict.txt’, ‘r’).read.split(‘\n’)
Python: _wordList = (open(‘/home/vince/fr_dict.txt’, ‘r’).read()).split(‘\n’)
[…]
ruby/crosswords.rb 1.52s user 0.13s system 68% cpu 2.407 total
[…]
python/crossword.py 0.52s user 0.07s system 31% cpu 1.875 total

Yes, looks like python is faster than ruby, but it’s possible to
speed-up both - python and ruby in your task, by using readlines
method. Take a look (i used /etc/termcap-BSD instead of your
fr_dict.txt) :wink:

% cat py.py
blah = open(‘/etc/termcap-BSD’, ‘r’).readlines()
% cat peigrek.py
blah = open(‘/etc/termcap-BSD’, ‘r’).read().split(‘\n’)
% cat rb.rb
blah = open(‘/etc/termcap-BSD’, ‘r’).readlines
% cat ryby.rb
blah = open(‘/etc/termcap-BSD’, ‘r’).read.split(‘\n’)

So, I’m assuming that py.py and rb.rb should work faster than
peigrek.py and ryby.rb, because they’re using readlines instead
of read and split methods. Let’s take a look at the results:

% time python py.py; time python peigrek.py; time ruby rb.rb; time ruby ryby.rb
python py.py 0.23s user 0.05s system 132% cpu 0.212 total
python peigrek.py 0.23s user 0.06s system 127% cpu 0.227 total

py.py is faster than peigrek.py - good

ruby rb.rb 0.74s user 0.06s system 108% cpu 0.735 total
ruby ryby.rb 0.84s user 0.08s system 109% cpu 0.841 total

rb.rb is faster than ryby.rb

It’s hard to ommit that python is much faster than ruby… :confused:

% time ruby rb.rb; time ruby ryby.rb; time python py.py; time python peigrek.py
ruby rb.rb 0.78s user 0.04s system 101% cpu 0.808 total
ruby ryby.rb 0.87s user 0.04s system 105% cpu 0.861 total

The same - rb.rb is faster than ryby.rb

python py.py 0.24s user 0.04s system 126% cpu 0.222 total
python peigrek.py 0.22s user 0.07s system 128% cpu 0.225 total

And same here.

So, we can speed both up, but even then python is much, much faster… ;]

As I mentioned in another post (I sent it when a gateway news-mail_list
was down so it’s possible, that it get lost) ruby works faster
(than ruby ;)) with:

File.readlines(‘/etc/termcap-BSD’)

Any ideas to speed it up even more ? :slight_smile:

ps. anyway, hi, I’m quite new here, since I subsribed to the
list yesterday :wink:

···

On Mon, Sep 23, 2002 at 11:44:30PM +0900, Vincent Foley wrote:


[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

Ruby: _wordList = open(‘/home/vince/fr_dict.txt’, ‘r’).read.split(‘\n’)
Python: _wordList = (open(‘/home/vince/fr_dict.txt’, ‘r’).read()).split(‘\n’)

Same thing really. Here are the outputs of time:
Ruby:
ruby/crosswords.rb 1.52s user 0.13s system 68% cpu 2.407 total

Python:
python/crossword.py 0.52s user 0.07s system 31% cpu 1.875 total

This way shows ruby a tad faster:

% time ruby -e “_wordList = open(‘/usr/share/dict/words’, ‘r’).readlines”
0.21 real 0.18 user 0.03 sys

% time python -c “_wordList = (open(‘/usr/share/dict/words’, ‘r’).read()).split(‘\n’)”
0.25 real 0.21 user 0.02 sys

but these are not identical since the ruby version now has the “\n” on
the end of each string. IO.read_and_chomp_lines() would be nice to have.

-joe

I tried a benchmark test with ruby 1.6.7:

require “benchmark”

a, b = 0, 0

Benchmark::bmbm do |x|
x.report(‘\n’.inspect){a=open(“/usr/share/dict/words”).read.split(‘\n’)}
x.report(“\n”.inspect){b=open(“/usr/share/dict/words”).read.split(“\n”)}
end

p a.size
p b.size

produces

Rehearsal -----------------------------------------
“\n” 1.437500 0.046875 1.484375 ( 1.740030)
“\n” 0.664062 0.054688 0.718750 ( 0.781546)
-------------------------------- total: 2.203125sec

          user     system      total        real

“\n” 2.671875 0.132812 2.804688 ( 3.055595)
“\n” 0.281250 0.015625 0.296875 ( 0.303488)
235881
235881

This shows split(‘\n’) takes about ten times total time than split(“\n”).

– Gotoken

···

At Mon, 23 Sep 2002 23:44:30 +0900, Vincent Foley wrote:

In my experience, Python is faster than Ruby. I made a small script to
help me do crossword puzzles. The first thing I do, is open a
dictionary file and put it in memory:

Ruby: _wordList = open(‘/home/vince/fr_dict.txt’, ‘r’).read.split(‘\n’)
Python: _wordList = (open(‘/home/vince/fr_dict.txt’, ‘r’).read()).split(‘\n’)

Vincent Foley vinfoley@iquebec.com wrote in message news:zDFj9.23405$32.432775@weber.videotron.net

In my experience, Python is faster than Ruby. I made a small script to
help me do crossword puzzles. The first thing I do, is open a
dictionary file and put it in memory:

Ruby: _wordList = open(‘/home/vince/fr_dict.txt’, ‘r’).read.split(‘\n’)
Python: _wordList = (open(‘/home/vince/fr_dict.txt’, ‘r’).read()).split(‘\n’)

Same thing really. Here are the outputs of time:
Ruby:
ruby/crosswords.rb 1.52s user 0.13s system 68% cpu 2.407 total

Python:
python/crossword.py 0.52s user 0.07s system 31% cpu 1.875 total

Python went over that file much more quickly. Maybe my technique for
opening a file in Ruby could be accelerated, but I think that the speed
of the Python interpreter is just greater than that of Ruby, that’s all.

Vince

Tests need to be timed internally from Start of Action to End of Action.
The tests above include the time to load the interpreter.
Perhaps the Python interpreter loads faster.

Which version of ruby were you using? I’ve been reading on this list
that ruby 1.7.x has some IO improvements somewhat improving speeds on
Linux and dramatically improving IO speeds on windows.

  • alan
···

On Tue, Sep 24, 2002 at 01:35:32AM +0900, gminick wrote:

Any ideas to speed it up even more ? :slight_smile:

ps. anyway, hi, I’m quite new here, since I subsribed to the
list yesterday :wink:


Alan Chen
Digikata LLC
http://digikata.com

You’re not fair ;> As I said in other post, python also has readlines
method, and it works faster than read/split method.

···

On Tue, Sep 24, 2002 at 02:30:01AM +0900, Joseph McDonald wrote:

% time ruby -e “_wordList = open(‘/usr/share/dict/words’, ‘r’).readlines”
0.21 real 0.18 user 0.03 sys

% time python -c “_wordList = (open(‘/usr/share/dict/words’, ‘r’).read()).split(‘\n’)”
0.25 real 0.21 user 0.02 sys

but these are not identical since the ruby version now has the “\n” on
the end of each string. IO.read_and_chomp_lines() would be nice to have.

[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

GOTO Kentaro wrote:

In my experience, Python is faster than Ruby. I made a small script to
help me do crossword puzzles. The first thing I do, is open a
dictionary file and put it in memory:

Ruby: _wordList = open(‘/home/vince/fr_dict.txt’, ‘r’).read.split(‘\n’)
Python: _wordList = (open(‘/home/vince/fr_dict.txt’, ‘r’).read()).split(‘\n’)

I tried a benchmark test with ruby 1.6.7:
[…]

Rehearsal -----------------------------------------
“\n” 1.437500 0.046875 1.484375 ( 1.740030)
“\n” 0.664062 0.054688 0.718750 ( 0.781546)
-------------------------------- total: 2.203125sec

          user     system      total        real

“\n” 2.671875 0.132812 2.804688 ( 3.055595)
“\n” 0.281250 0.015625 0.296875 ( 0.303488)
235881
235881

This shows split(‘\n’) takes about ten times total time than split(“\n”).

I am confused. Ruby ‘\n’ is not equivalent with Ruby “\n”, so a
difference would not be unexpected. (The size of the difference perhaps)
In Python, “\n” and ‘\n’ are the same and equivalent to Ruby “\n”.

It appears to me that the original comparision is null and void, since
it does not do the same thing. Has anybody looked closer at the output
of the Python vs the Ruby versions, like running a diff? It sure looks
different to me…

···

At Mon, 23 Sep 2002 23:44:30 +0900, > Vincent Foley wrote:


([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student
/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_
/ö____/ (_engineering.discipline=Computer::Technology)

[…]

% cat rb.rb
blah = open(‘/etc/termcap-BSD’, ‘r’).readlines
% cat ryby.rb
blah = open(‘/etc/termcap-BSD’, ‘r’).read.split(‘\n’)

So, I’m assuming that py.py and rb.rb should work faster than
peigrek.py and ryby.rb, because they’re using readlines instead
of read and split methods. Let’s take a look at the results:

% time python py.py; time python peigrek.py; time ruby rb.rb; time ruby ryby.rb
python py.py 0.23s user 0.05s system 132% cpu 0.212 total
python peigrek.py 0.23s user 0.06s system 127% cpu 0.227 total

py.py is faster than peigrek.py - good

ruby rb.rb 0.74s user 0.06s system 108% cpu 0.735 total
ruby ryby.rb 0.84s user 0.08s system 109% cpu 0.841 total

rb.rb is faster than ryby.rb

OK, i downloaded and installed ruby1.7.3, and… it’s faster
than older versions of ruby, but still there’s a large distance
beetwen speeds of python and ruby (py is faster).
A little test: ruby1.7.3, blah = open(‘/etc/termcap-BSD’, ‘r’).readlines

% time ruby rb.rb
ruby rb.rb 0.40s user 0.09s system 97% cpu 0.500 total

ps. for speed maniacs (not me;)): it’s time to switch to ruby1.7.3

···

On Tue, Sep 24, 2002 at 01:35:32AM +0900, gminick wrote:


[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

Any ideas to speed it up even more ? :slight_smile:

ps. anyway, hi, I’m quite new here, since I subsribed to the
list yesterday :wink:
Which version of ruby were you using?
ruby 1.6.7 (2002-03-01) [i586-linux]

I’ve been reading on this list
that ruby 1.7.x has some IO improvements somewhat improving speeds on
Linux and dramatically improving IO speeds on windows.
OK, I’m downloading
http://mirrors.sunsite.dk/ruby/snapshots/ruby-1.7-today.tar.gz

Isn’t www.ruby-lang.org a little out of date ?
Found that on site:

 * Stable snapshot is available. (2001-01-18)

2001-01-18 ???

···

On Tue, Sep 24, 2002 at 02:30:01AM +0900, Alan Chen wrote:

On Tue, Sep 24, 2002 at 01:35:32AM +0900, gminick wrote:


[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

In article 20020923195806.GA1278@hannibal, gminick wrote:

You’re not fair ;> As I said in other post, python also has readlines
method, and it works faster than read/split method.

readlines does not do what I want. It gives an array that looks like
this:
[“aaa\n”, “bbb\n”, “ccc\n”]

while read.split(‘\n’) gives:
[“aaa”, “bbb”, “ccc”]

which is exactly what I want.

···

Vincent Foley-Bourgon
Email: vinfoley@iquebec.com
Homepage: http://darkhost.mine.nu:81

Kent Dahl wrote:

I am confused. Ruby ‘\n’ is not equivalent with Ruby “\n”, so a
difference would not be unexpected. (The size of the difference perhaps)
In Python, “\n” and ‘\n’ are the same and equivalent to Ruby “\n”.

It appears to me that the original comparision is null and void, since
it does not do the same thing. Has anybody looked closer at the output
of the Python vs the Ruby versions, like running a diff? It sure looks
different to me…

Sorry, I had a brainfart on the latter. The output is the same.
But that is even more confusing. I was expecting:
“yabba\dabba\none”.split(“\n”) == [“yabba\dabba”, “one”]

Is there some extra conversion being done to the delimiter inside split?

···


([ Kent Dahl ]/)_ ~ [ http://www.stud.ntnu.no/~kentda/ ]/~
))_student
/(( _d L b_/ NTNU - graduate engineering - 5. year )
( __õ|õ// ) )Industrial economics and technological management(
_
/ö____/ (_engineering.discipline=Computer::Technology)

As you said ‘\n’ != “\n” and ‘\n’ == “\n” indeed. And also

p Regexp.new(“\n”) == eval(“/\n/”) #=> true
p Regexp.new(“\n”) == eval(“/\n/”) #=> true

Try below if you are confused:
p “/\n/”
p “/\n/”

By the way, in 1.7.x, String#split behavior was changed about a month ago:

% ruby17 -e ‘p “a\nb”.split(“\n”)’
-e:1: warning: string pattern instead of regexp; metacharacters no longer effective
[“a\nb”]
% ruby17 -e ‘p “a\nb”.split(“\n”)’
-e:1: warning: string pattern instead of regexp; metacharacters no longer effective
[“a”, “b”]

In current version 1.7.3, split(sep) does like as follows:

def split(sep)
case sep
when String
sep = if sep.size > 1
Regexp.new(Regexp.escape(sep))
else
Regexp.new(sep)
end
when Regexp
else
raise TypeError, “wrong argument type #{sep.type} (expected Regexp)”
end

And do real split action with sep

end

– Gotoken

···

At Tue, 24 Sep 2002 03:58:45 +0900, Kent Dahl wrote:

I am confused. Ruby ‘\n’ is not equivalent with Ruby “\n”, so a
difference would not be unexpected. (The size of the difference perhaps)

Alan Chen wrote:

Which version of ruby were you using? I’ve been reading on this list
that ruby 1.7.x has some IO improvements somewhat improving speeds on
Linux and dramatically improving IO speeds on windows.

  • alan

I did two similar (as close as you can for Python and Ruby) scripts that
searched for keywords in a file and then wrote the lines found to
another file. The Python one is still noticeably faster.

  • BOB -

A little test: ruby1.7.3, blah = open(‘/etc/termcap-BSD’, ‘r’).readlines

% time ruby rb.rb
ruby rb.rb 0.40s user 0.09s system 97% cpu 0.500 total

ps. for speed maniacs (not me;)): it’s time to switch to ruby1.7.3


[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

I did not originally mean this thread to be for speed maniacs :slight_smile:
My question was how to speedup processing large files (400-500+ MBytes) not those like
/etc/termcap*
It is the about minutes you save not about 0.1 sec :wink:

Eventhough I thank all who help me.

Tom

···

Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!