Speed up suggestions

Hello,

I have written just a simple script to analyze a log file and (just for fun) I have written
exactly the same in python to see the difference and …
python is almost twice the faster doing the same job :expressionless: (???)
You can see attached files for sources.
Environment: P4 1.8Ghz, 256MB, WinXP Pro. Python 2.2.1, Ruby 1.7.2-4 - the Pragmatic distribution.
The analyzed file is about 420 Mbytes and python does it in about 60 sec and ruby in about 115
sec.
Have some suggestion how to speed the ruby code?

Regards

Tom

parse.py (507 Bytes)

parse.rb (352 Bytes)

···

Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!

Sorry, I don’t know python, but I am trying to run your script
and I get the following error:

Traceback (most recent call last):
File “parse.py”, line 5, in ?
outf = file( “found.log”, “w” )
NameError: name ‘file’ is not defined

···

On Sat, Sep 21, 2002 at 01:22:34AM +0900, Tomas Brixi wrote:


Jim Freeze

Programming Ruby
def initialize; fun; end
A language with class

your script:

  1. outf = File.new( “found.log”, “w” )
  2. r = Regexp.new( “xxx\xxxxxxxx”, Regexp::IGNORECASE )
  3. t = Time.new
  4. puts t
  5. File.new( “WEB.log” ).each_line do
  6. |line|
    
  7. next if line.index( "#" ) == 0
    
  8. pom = line.split( "\t" )
    
  9. if pom[1] =~ r
    
  10. outf.write( line + "\n" )
    
  11. end
  12. end
  13. te = Time.new
  14. puts te
  15. print "Time: "
  16. puts (te - t).to_s
  17. outf.close()

if in line 07 you are checking to see if the first character is a # and
thus ignore the line then do this…

  1. next if line[0]==35

…where ‘35’ is the ascii value of ‘#’ and on line 10 change to…

  1. outf.puts(line)
    

…where puts writes a line w/crlf…in your code you add two objects
(strings) together…which may effect speed.

Also on line 08 an 09 you split a line based on tab which creates an
array. It may be faster (albiet uglier) to do this:

08 t1 = line.index(“\t”)+1; pom = line[t1…(line.index(“\t”, t1)-1)]
09. if pom =~ r

-rich

···

-----Original Message-----
From: Tomas Brixi [mailto:tomas_brixi@yahoo.com]
Sent: Friday, September 20, 2002 12:23 PM
To: ruby-talk ML
Subject: Speed up suggestions

Hello,

I have written just a simple script to analyze a log file and
(just for fun) I have written exactly the same in python to
see the difference and … python is almost twice the faster
doing the same job :expressionless: (???) You can see attached files for sources.
Environment: P4 1.8Ghz, 256MB, WinXP Pro. Python 2.2.1, Ruby
1.7.2-4 - the Pragmatic distribution. The analyzed file is
about 420 Mbytes and python does it in about 60 sec and ruby
in about 115 sec. Have some suggestion how to speed the ruby code?

Regards

Tom


Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

Well… I don’t have a WEB.log file as weird as yours so I had to fake
it. :slight_smile:

First, I got an error (ruby 1.6.7) with your first regex. Changing it
to /…/i fixed that. Not sure why… Fixing that and running w/
-rprofile shows:

50.00 0.03 0.03 1 31.25 54.69 IO#each_line
25.00 0.05 0.02 46 0.34 0.34 String#index
12.50 0.05 0.01 41 0.19 0.19 String#split
12.50 0.06 0.01 7 1.12 1.12 IO#write

There are several things you can do to speed up the ruby code, and
probably the python code… These are really general ideas:

  1. Don’t index the whole line in the case of it NOT being a comment.
  2. Don’t split the whole line if you only want the second field.

I had to tweak my script to work on a different file, splitting on
colon and such. But here are my changes:

outside the loop:

comment_re = /^#/

inside the loop:

next if line =~ comment_re

if line =~ /[^:]+:([^:]+):confused: then
pom = $1
if pom =~ r
outf.write( line + “\n” )

This speeds up my run by more than half for an admittedly small and
unscientific sample. But the changes will speed up python as well… I
suggest you look at the “big language shootout” at
http://www.bagley.org/~doug/shootout/ for some speed differences
between python and ruby… python is only moderatly faster than perl or
ruby… they all have their own pros and cons (like ruby’s method
dispatching blows away python and perl, but it’s not as fast in
numerics).

···

On Friday, September 20, 2002, at 09:22 AM, Tomas Brixi wrote:

I have written just a simple script to analyze a log file and (just
for fun) I have written
exactly the same in python to see the difference and …
python is almost twice the faster doing the same job :expressionless: (???)
You can see attached files for sources.
Environment: P4 1.8Ghz, 256MB, WinXP Pro. Python 2.2.1, Ruby 1.7.2-4 -
the Pragmatic distribution.
The analyzed file is about 420 Mbytes and python does it in about 60
sec and ruby in about 115
sec.
Have some suggestion how to speed the ruby code?

Tomas Brixi wrote:

Have some suggestion how to speed the ruby code?

You can check for the comment by
writing
next if line[0] == ?#

You can write
puts line
instead of
write line+“\n”
(Copies the string every time unnecessarily…)

Regards, Christian

I don’t know how to find the attachments.

Have you tried the profiler?
ruby -rprofile file.rb

I other numbers, when I searched for any 6-letter word, here are the
time outputs:

Ruby:
ruby/crosswords.rb 2.38s user 0.12s system 21% cpu 11.486 total

Python:
python/crossword.py 1.65s user 0.14s system 15% cpu 11.364 total

I do not mean to diss Ruby here, but if speed is your concern, maybe
you would be better off with Python.

Vince

···

Vincent Foley-Bourgon
Email: vinfoley@iquebec.com
Homepage: http://darkhost.mine.nu:81

WOW! I just found something out that is pretty interesting… too bad I
don’t understand why yet… By sheer luck I noticed in my email that I
didn’t extract the above regex to be instantiated outside the loop. So
I did it. And it slowed down, a lot! here’s the profile:

BEFORE:

time = 0.03045900 sec
% cumulative self self total
time seconds seconds calls ms/call ms/call name
50.00 0.01 0.01 1 7.81 15.62 IO#each_line
50.00 0.02 0.01 86 0.09 0.09 String#=~

AFTER:

time = 0.04492900 sec
% cumulative self self total
time seconds seconds calls ms/call ms/call name
66.67 0.03 0.03 1 31.25 46.88 IO#each_line
33.33 0.05 0.02 127 0.12 0.12 String#=~

So I made a new version and got rid of all the regex instances (diff
follows) and the profile looks like:

time = 0.00842100 sec
% cumulative self self total
time seconds seconds calls ms/call ms/call name
100.00 0.01 0.01 1 7.81 7.81 IO#each_line

(we are now so small and fast (file sould be bigger for better
sampling) that the profile has no relevant data… subsequent runs show
the sampling jumping all over but the total time is the same)

<528> diff parse[34].rb
4,6c4,6
< r = /xxx\xxxxxxxx/i
< second_re = /[^:]+:([^:]+):confused:
< comment_re = /^#/

···

On Friday, September 20, 2002, at 12:54 PM, Ryan Davis wrote:

outside the loop:

comment_re = /^#/

inside the loop:

next if line =~ comment_re

if line =~ /[^:]+:([^:]+):confused: then
pom = $1
if pom =~ r
outf.write( line + “\n” )


10c10
< next if line =~ comment_re

next if line =~ /^#/
12c12
< if line =~ second_re then


if line =~ /[^:]+:([^:]+):confused: then
14c14
< if pom =~ r


if pom =~ /xxx\\xxxxxxxx/i

NOW, can someone tell me why regexen in place are much faster than
instantiated regexen?

Please, show the code :wink:

···

On Tue, Sep 24, 2002 at 12:22:56AM +0900, Vincent Foley wrote:

I other numbers, when I searched for any 6-letter word, here are the
time outputs:

[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

Hi,

···

At Sat, 21 Sep 2002 05:18:10 +0900, Ryan Davis wrote:

NOW, can someone tell me why regexen in place are much faster than
instantiated regexen?

Literal other than String doesn’t create new instance. OTOH,
local variable needs one more rb_eval() recursion, in current
implemetation.


Nobu Nakada

In article 20020923184022.GA1056@hannibal, gminick wrote:

I other numbers, when I searched for any 6-letter word, here are the
time outputs:
Please, show the code :wink:

Ruby (that code has a serious bug though, see my thread, about
String#gsub!):

#!/usr/bin/env ruby

wordList = open('/home/vince/fr_dict.txt', 'r').read.split('\n')
repl = {
    '_'   => '[a-ü]',
    '(v)' => '[aâàéeèêëiîïoôöuûùüy]', 
    '(c)' => '[bcdfghjklmnpqrstvwxyz]',
    '(e)' => '[éeèêë]'
}    

while true
  print '> '
  reg = gets.chomp
  repl.each_key { |token|
    reg.gsub!(token, repl[token])
  }
  reg = '^' << reg << '$'
  puts reg
  for word in wordList
    puts word if word =~ reg
  end
end

Python (this code works perfectly. Doesn’t have the ruby version bug)

#!/usr/bin/env python2.2

import re

_wordList = (open('/home/vince/fr_dict.txt', 'r').read()).split('\n')
_repl = {
    '_': '[a-ü]',
    '(v)': '[aâàéeèêëiîïoôöuûùüy]', 
    '(c)': '[bcdfghjklmnpqrstvwxyz]',
    '(e)': '[éeèêë]'
}    


while 1:
    reg = raw_input('> ')
    for token in _repl:
        reg = reg.replace(token, _repl[token])
    reg = "%s%s%s" % ('^', reg, '$')
    reg = re.compile(reg, re.I)
    for word in _wordList:
        if reg.search(word):
            print word

There ya go. If you ahve any information about why I have that bug in
the Ruby version, please let met know ASAP :wink:

Vincent

···

On Tue, Sep 24, 2002 at 12:22:56AM +0900, Vincent Foley wrote:

Vincent Foley-Bourgon
Email: vinfoley@iquebec.com
Homepage: http://darkhost.mine.nu:81

Can somebody tell me why the plural of “regex” is “regexen”?

Gavin

···

----- Original Message -----
From: “Ryan Davis” ryand-ruby@zenspider.com

NOW, can someone tell me why regexen in place are much faster than
instantiated regexen?

Ruby (that code has a serious bug though, see my thread, about
String#gsub!):

#!/usr/bin/env ruby
> 
> wordList = open('/home/vince/fr_dict.txt', 'r').read.split('\n')
> repl = {
>     '_'   => '[a-ü]',
>     '(v)' => '[aâàéeèêëiîïoôöuûùüy]', 
>     '(c)' => '[bcdfghjklmnpqrstvwxyz]',
>     '(e)' => '[éeèêë]'
> }    
> 
> while true
>   print '> '
>   reg = gets.chomp
>   repl.each_key { |token|
>     reg.gsub!(token, repl[token])
>   }
>   reg = '^' << reg << '$'
>   puts reg
>   for word in wordList
>     puts word if word =~ reg
>   end
> end

You seem to be mixing up Strings and Regexps, here I have made the keys
to repl Regexps and made reg a real Regexp. Does this do what you want?

#!/usr/bin/env ruby

repl = {
/_/ => ‘[a-ü]’,
/(v)/ => ‘[aâàéeèêëiîïoôöuûùüy]’,
/(c)/ => ‘[bcdfghjklmnpqrstvwxyz]’,
/(e)/ => ‘[éeèêë]’
}

loop do
print '> ’
reg = gets.chomp
repl.each_key { |token|
reg.gsub!(token, repl[token])
}
reg = /^#{reg}$/
puts reg

# file content match against reg removed...

end

There ya go. If you ahve any information about why I have that bug in
the Ruby version, please let met know ASAP :wink:

Ruby seems to work fine :wink: It is not the same as Python…

Hope this helps,

Mike

···

In article irJj9.27338$32.485317@weber.videotron.net, Vincent Foley wrote:


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA

I’m not even sure, what your code does, but try this:

repl = {
‘_’ => ‘[a-ü]’,
‘(v)’ => ‘[aâ?ée??ëiî?oôöu??üy]’,
‘(c)’ => ‘[bcdfghjklmnpqrstvwxyz]’,
‘(e)’ => ‘[ée??ë]’
}

and the rest of your code… :wink:

···

On Tue, Sep 24, 2002 at 03:58:48AM +0900, Vincent Foley wrote:

#!/usr/bin/env ruby

wordList = open(‘/home/vince/fr_dict.txt’, ‘r’).read.split(‘\n’)
repl = {
‘_’ => ‘[a-ü]’,
‘(v)’ => ‘[aâ?ée??ëiî?oôöu??üy]’,
‘(c)’ => ‘[bcdfghjklmnpqrstvwxyz]’,
‘(e)’ => ‘[ée??ë]’
}

[ Wojtek gminick Walczak ][ http://gminick.linuxsecurity.pl/ ]
[ gminick (at) hacker.pl ][ gminick (at) underground.org.pl/ ]

no.

···

On Monday, September 23, 2002, at 02:08 PM, Gavin Sinclair wrote:

----- Original Message -----
From: “Ryan Davis” ryand-ruby@zenspider.com

NOW, can someone tell me why regexen in place are much faster than
instantiated regexen?

Can somebody tell me why the plural of “regex” is “regexen”?

Gavin Sinclair wrote:

From: “Ryan Davis” ryand-ruby@zenspider.com

NOW, can someone tell me why regexen in place are much faster than
instantiated regexen?

Can somebody tell me why the plural of “regex” is “regexen”?

German influence?

···

----- Original Message -----


Giuseppe “Oblomov” Bilotta

“E la storia dell’umanità, babbo?”
“Ma niente: prima si fanno delle cazzate,
poi si studia che cazzate si sono fatte”
(Altan)
(“And what about the history of the human race, dad?”
“Oh, nothing special: first they make some foolish things,
then you study which foolish things have been made”)

my guess is this:
plural of ox is oxen
so obviously plural of computer is boxen
and thus that of regex must be regexen.

pure deductive logic :wink:

···

Gavin Sinclair (gsinclair@soyabean.com.au) wrote:

----- Original Message -----
From: “Ryan Davis” ryand-ruby@zenspider.com

NOW, can someone tell me why regexen in place are much faster than
instantiated regexen?

Can somebody tell me why the plural of “regex” is “regexen”?

Gavin

In article 52995333-CF39-11D6-8A13-0030657CEB62@zenspider.com,
Ryan Davis wrote:

From: “Ryan Davis” ryand-ruby@zenspider.com

NOW, can someone tell me why regexen in place are much faster than
instantiated regexen?

Can somebody tell me why the plural of “regex” is “regexen”?

no.

Oh, go on! One reference is:

http://www.tuxedo.org/~esr/jargon/html/Overgeneralization.html

Hope this helps,

Mike

···

On Monday, September 23, 2002, at 02:08 PM, Gavin Sinclair wrote:

----- Original Message -----


mike@stok.co.uk | The “`Stok’ disclaimers” apply.
http://www.stok.co.uk/~mike/ | GPG PGP Key 1024D/059913DA
mike@exegenix.com | Fingerprint 0570 71CD 6790 7C28 3D60
http://www.exegenix.com/ | 75D2 9EC4 C1C0 0599 13DA