Optimization question

Hello - A friend and I have been working on a Ruby implementation of a
bayesian spam filter as described in Paul Graham’s, Plan For Spam.
It’s fully functional, but I’ve been trying to squeeze more performance out
of it as it’s quite slow ATM(15 minutes to run across 20 megs of email).
By using profile and rbprof, I’ve determined that our tokenizer method is
the main source of slowness. After some careful benchmarking, I’ve found this
to be the problem.

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end

At first, I though I could do something like this:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok].succ
end

But I realized that it doesn’t modify the final value; however, I did notice
that it ran twice as fast than the former example. So, my question is, does the
assignment in the first example really have that much overhead? If so, is there
any way to do the first example using Inline::C or something similar?

Thanks in advance,
Travis

# This is ran about a million times
h = Hash.new(0)
data.scan(iptok).each do |tok|
  h[tok] += 1
end

At first, I though I could do something like this:

# This is ran about a million times
h = Hash.new(0)
data.scan(iptok).each do |tok|
  h[tok].succ
end

But I realized that it doesn't modify the final value; however, I did notice
that it ran twice as fast than the former example.

Unfortunately it doesn't actually put anything into the hash either, so it's
not a fair comparison. Have you tried:

   h[tok] = 1

and see how that performs? Also

   h[tok] = h[tok]

The first case will show time for inserting elements into hash, and the
second case both retrieving and inserting elements (they will be inserted
with value 'nil'). I suspect that the '+1' isn't actually going to be the
main source of the problem.

is there
any way to do the first example using Inline::C or something similar?

See 'extending Ruby' in the Pickaxe.

Regards,

Brian.

···

On Thu, Feb 20, 2003 at 07:53:14AM +0900, Travis Whitton wrote:

  1. Are the tokens strings?
  2. Are you running Linux?

You might be able to use glib hashes to do this in C and then translate
those to Ruby hashes. Just a thought. I’m putting together some code to
see if I can figure it out.

···

On Thu, Feb 20, 2003 at 07:53:14AM +0900, Travis Whitton wrote:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

I wrote a C version of your code which uses glib. I’m attaching it as
Filter.c, as well as the extconf.rb. If you have glib, you can just do:

ruby extconf.rb
make

This creates Filter.so. There seems to be a performance improvement, but
it’s not as large as I might have hoped. Here are 3 tests:

<----- t1.rb ---- Your original code ------------------->
5_000.times do
h = Hash.new(0)
%w{a b c a g d e f a g h g i j e e k l m g d d a b c}.each do |tok|
h[tok] += 1
end
end
<----- t2.rb ---- Replaced Array#each by a for loop. —>
5_000.times do
h = Hash.new(0)
for tok in %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
h[tok] += 1
end
end
<----- t3.rb ---- Uses the C version. ------------------->
require “Filter”

5_000.times do
h = Filter.process %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
end
<----------------------------->
And here are some results:

dcarrera $ time ruby t1.rb

real 0m3.205s
user 0m3.130s
sys 0m0.010s
dcarrera $ time ruby t2.rb

real 0m3.006s
user 0m2.860s
sys 0m0.030s
dcarrera $ time ruby t3.rb

real 0m1.677s
user 0m1.490s
sys 0m0.090s

Who knows, this might behave better with your data.

···

On Thu, Feb 20, 2003 at 07:53:14AM +0900, Travis Whitton wrote:

Hello - A friend and I have been working on a Ruby implementation of a
bayesian spam filter as described in Paul Graham’s, Plan For Spam.
It’s fully functional, but I’ve been trying to squeeze more performance out
of it as it’s quite slow ATM(15 minutes to run across 20 megs of email).
By using profile and rbprof, I’ve determined that our tokenizer method is
the main source of slowness. After some careful benchmarking, I’ve found this
to be the problem.

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end

At first, I though I could do something like this:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok].succ
end

But I realized that it doesn’t modify the final value; however, I did notice
that it ran twice as fast than the former example. So, my question is, does the
assignment in the first example really have that much overhead? If so, is there
any way to do the first example using Inline::C or something similar?

Thanks in advance,
Travis


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Hello Travis,

Thursday, February 20, 2003, 1:53:14 AM, you wrote:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end

are you heard about “sort|uniq” idiom? you can use
such code:

sorted_tokens = data.scan(iptok).sort

#The following code pushes in arr pairs of token, count
arr =
prev_token = nil
count = 0
sorted_tokens.each do |tok|
if tok == prev_token
count += 1
else
arr << prev_token << count
prev_token = tok
count = 1
end
end
if count > 0
arr << prev_token << count
end

h = Hash[*arr]

···


Best regards,
Bulat mailto:bulatz@integ.ru

(I’m making a big assumption here, so if I’m wrong, please kindly
disregard)

Since I’ve lost the original email, I’ll just say that I there’s a
more efficient way to do what you’re doing. Keep a real database of
tokens, and use mail to update it, rather than re-tokenizing the whole
email set.

See: http://glueless.net/bayes/ (not by me)

My .rbayes-corpus.db is 10MB, and I started using it in Sept.

···

Daniel Carrera (dcarrera@math.umd.edu) wrote:

On Thu, Feb 20, 2003 at 07:53:14AM +0900, Travis Whitton wrote:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end

  1. Are the tokens strings?
  2. Are you running Linux?

You might be able to use glib hashes to do this in C and then translate
those to Ruby hashes. Just a thought. I’m putting together some code to
see if I can figure it out.


Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

Sorry, forgot to attach the files.

Filter.c (1.21 KB)

extconf.rb (139 Bytes)

···

On Thu, Feb 20, 2003 at 10:59:48AM +0900, Daniel Carrera wrote:

I wrote a C version of your code which uses glib. I’m attaching it as
Filter.c, as well as the extconf.rb. If you have glib, you can just do:

ruby extconf.rb
make

This creates Filter.so. There seems to be a performance improvement, but
it’s not as large as I might have hoped. Here are 3 tests:

<----- t1.rb ---- Your original code ------------------->
5_000.times do
h = Hash.new(0)
%w{a b c a g d e f a g h g i j e e k l m g d d a b c}.each do |tok|
h[tok] += 1
end
end
<----- t2.rb ---- Replaced Array#each by a for loop. —>
5_000.times do
h = Hash.new(0)
for tok in %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
h[tok] += 1
end
end
<----- t3.rb ---- Uses the C version. ------------------->
require “Filter”

5_000.times do
h = Filter.process %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
end
<----------------------------->
And here are some results:

dcarrera $ time ruby t1.rb

real 0m3.205s
user 0m3.130s
sys 0m0.010s
dcarrera $ time ruby t2.rb

real 0m3.006s
user 0m2.860s
sys 0m0.030s
dcarrera $ time ruby t3.rb

real 0m1.677s
user 0m1.490s
sys 0m0.090s

Who knows, this might behave better with your data.

On Thu, Feb 20, 2003 at 07:53:14AM +0900, Travis Whitton wrote:

Hello - A friend and I have been working on a Ruby implementation of a
bayesian spam filter as described in Paul Graham’s, Plan For Spam.
It’s fully functional, but I’ve been trying to squeeze more performance out
of it as it’s quite slow ATM(15 minutes to run across 20 megs of email).
By using profile and rbprof, I’ve determined that our tokenizer method is
the main source of slowness. After some careful benchmarking, I’ve found this
to be the problem.

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end

At first, I though I could do something like this:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok].succ
end

But I realized that it doesn’t modify the final value; however, I did notice
that it ran twice as fast than the former example. So, my question is, does the
assignment in the first example really have that much overhead? If so, is there
any way to do the first example using Inline::C or something similar?

Thanks in advance,
Travis


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

  1. Are the tokens strings?

Yes, my program goes through two files. One consists of only non-spam
messages, and the other is only spam messages. It goes through each file
line by line and divides each line into tokens of interesting data. Here are
the relevant portions of my tokenizer method.

def tokenizer(fh)
hash = Hash.new(0)
ipaddr = ‘[0-9]+.[0-9]+.[0-9]+.[0-9]+’
token = “[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]”
iptok = Regexp.compile(“#{token}|#{ipaddr}”)

fh.each do |data|
data.chomp!
# do a number of string substitutions which use negligible amounts of time
data.scan(iptok).each do |tok|
hash[tok] = hash[tok].succ
end
end
hash
end

The messages are standard unix messages(mbox format?) like so:

From MAILER-DAEMON Mon Sep 23 22:32:37 2002

···

Date: 23 Sep 2002 22:32:37 -0400
From: Mail System Internal Data MAILER-DAEMON@grub.ath.cx
Subject: DON’T DELETE THIS MESSAGE – FOLDER INTERNAL DATA
Message-ID: 1032834757@grub.atlantic.net
X-IMAP: 1032834509 0000000272
Status: RO

This text is part of the internal format of your mail folder, and is not
a real message. It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.

  1. Are you running Linux?

Yes, and I only intend for this program to run under unix based systems.

You might be able to use glib hashes to do this in C and then translate
those to Ruby hashes. Just a thought. I’m putting together some code to
see if I can figure it out.

Thanks very much for your help. I sincerely appreciate it! As a side note,
the program is already on the RAA:

http://raa.ruby-lang.org/list.rhtml?name=bsproc

So, you can grok through the code if you would like to; however, it’s not
exactly the same as the development version. As a second side not, although
it is slow to create the probability database, the program is extremely
good at filtering spam. A very small minority of spam messages make their way
into my inbox, and it feels damned good to have written my own spam filter.
Also, I’ve tried strscan in place of scan, and the speedup wasn’t significant.
As it turns out, most of the calculation time is spend doing hash lookups. I’ve
considered RJudy, but apparently, it’s hashes are slower than Ruby native
hashes… go figure.

Thanks much,
Travis

I think that sorting is about as expensive as hashing.
Here is a quick benchmark for 5,000 iterations:

$ time ruby original_code.rb

real 0m3.888s
user 0m3.520s
sys 0m0.020s

$ time ruby your_code.rb

real 0m3.453s
user 0m3.140s
sys 0m0.040s

I still think that using C is a good idea.

···

On Thu, Feb 20, 2003 at 02:58:08PM +0900, Bulat Ziganshin wrote:

Hello Travis,

Thursday, February 20, 2003, 1:53:14 AM, you wrote:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end

are you heard about “sort|uniq” idiom?


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

I wrote faster extension ‘hashsucc.so’.
glib is not needed.

Install:
ruby extconf.rb make
% make install

Usage:
h = Hash.new(0)
h.succ! key #=> equivalent h[key]+=1
h.succ! %w(foo bar baz) #=> equivalent
# %w(foo bar baz).each do |key|
# h[key]+=1
# end

<----- b_orig.rb ---- original code ------------------->
50_000.times do
h = Hash.new(0)
for tok in %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
h[tok] += 1
end
end
<----- b_filter.rb ---- Filter.so ------------------->
require “Filter”

50_000.times do
h = Filter.process %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
end
<----- b_hashsucc.rb ---- hashsucc.so ------------------->
require “hashsucc”

50_000.times do
Hash.new.succ! %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
end
<----------------------------->

Results:
time ruby b_orig.rb ruby b_orig.rb 12.46s user 0.01s system 100% cpu 12.467 total time ruby b_filter.rb
ruby b_filter.rb 8.29s user 0.12s system 100% cpu 8.402 total
$ time ruby b_hashsucc.rb
ruby b_hashsucc.rb 5.00s user 0.01s system 100% cpu 5.004 total

extconf.rb (71 Bytes)

hashsucc.c (878 Bytes)

···


MoonWolf moonwolf@moonwolf.com

Since I’ve lost the original email, I’ll just say that I there’s a
more efficient way to do what you’re doing. Keep a real database of
tokens, and use mail to update it, rather than re-tokenizing the whole
email set.

This is a good idea, and it looks like an interesting project. Does it allow
you to create a database file from already existing corpus files, or does it
require you to start from scratch? I’ve spent a good amount of time building
up the 2000 some-odd messages I’m currently using to gather my statistics,
and I’d hate to have two huge corpus files wasted. Also, the main reason I
started this project was to learn, so I’d still like to find a fast way to
update a hash key without having to access it twice.

Thanks,
Travis

Hello Daniel,

Thursday, February 20, 2003, 9:13:16 AM, you wrote:

I think that sorting is about as expensive as hashing.
Here is a quick benchmark for 5,000 iterations:

i think that really expensive is interpreting the code.
if you think about C code, you can try to squeeze second
block in my program into Array#uniqc method. imvho,
it is the simplest way to speed up your program - just
because other variants will require much more efforts
to code

···


Best regards,
Bulat mailto:bulatz@integ.ru

Hello Daniel,

Thursday, February 20, 2003, 9:13:16 AM, you wrote:

This is ran about a million times

h = Hash.new(0)
data.scan(iptok).each do |tok|
h[tok] += 1
end

are you heard about “sort|uniq” idiom?

I still think that using C is a good idea.

sorry, i don’t seen that your are not Travis :slight_smile:

i think that better way to speed up this code is:

h = data.scan(iptok).sort.uniqc.to_hash

where
Array#to_hash - reverse operaqion to Hash#to_a
Array#uniqc - gets sorted array of items and returns array of pairs [item, count]

and these two methods must be included in Ruby itself, because they support
very widespread programming idioms

···


Best regards,
Bulat mailto:bulatz@integ.ru

“Travis Whitton” whitton@atlantic.net schrieb im Newsbeitrag
news:BhZ4a.10243$Mr5.3967@fe06.atl2.webusenet.com

  1. Are the tokens strings?

Yes, my program goes through two files. One consists of only non-spam
messages, and the other is only spam messages. It goes through each file
line by line and divides each line into tokens of interesting data. Here
are
the relevant portions of my tokenizer method.

def tokenizer(fh)
hash = Hash.new(0)
ipaddr = ‘[0-9]+.[0-9]+.[0-9]+.[0-9]+’

Maybe you can improve performance by changing this to:

ipaddr = ‘[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}’

or

ipaddr = ‘[0-9]{1,3}(.[0-9]{1,3}){3}’

This should make the regexp fail faster for longer sequences of digits.
Just a guess, but maybe worth trying.

Regards

robert

token = “[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]”
iptok = Regexp.compile(“#{token}|#{ipaddr}”)

fh.each do |data|
data.chomp!
# do a number of string substitutions which use negligible amounts of
time
data.scan(iptok).each do |tok|
hash[tok] = hash[tok].succ
end
end
hash
end

The messages are standard unix messages(mbox format?) like so:

From MAILER-DAEMON Mon Sep 23 22:32:37 2002
Date: 23 Sep 2002 22:32:37 -0400
From: Mail System Internal Data MAILER-DAEMON@grub.ath.cx
Subject: DON’T DELETE THIS MESSAGE – FOLDER INTERNAL DATA
Message-ID: 1032834757@grub.atlantic.net
X-IMAP: 1032834509 0000000272
Status: RO

This text is part of the internal format of your mail folder, and is not
a real message. It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.

  1. Are you running Linux?

Yes, and I only intend for this program to run under unix based systems.

You might be able to use glib hashes to do this in C and then translate
those to Ruby hashes. Just a thought. I’m putting together some code
to
see if I can figure it out.

Thanks very much for your help. I sincerely appreciate it! As a side
note,
the program is already on the RAA:

http://raa.ruby-lang.org/list.rhtml?name=bsproc

So, you can grok through the code if you would like to; however, it’s not
exactly the same as the development version. As a second side not,
although
it is slow to create the probability database, the program is extremely
good at filtering spam. A very small minority of spam messages make their
way
into my inbox, and it feels damned good to have written my own spam
filter.
Also, I’ve tried strscan in place of scan, and the speedup wasn’t
significant.
As it turns out, most of the calculation time is spend doing hash
lookups. I’ve

···

considered RJudy, but apparently, it’s hashes are slower than Ruby native
hashes… go figure.

Thanks much,
Travis

Yes, my program goes through two files. …

You might want to take a look at a perl implementation called POPfile. It’s
more of a classical bayes implementation without the “tweaks” and such that Paul
Graham has introduced. Too, it allows arbitrary buckets, not just “spam/not
spam”.

It’s on sourceforge, so you can get the source.

Just a thought to give you some ideas on further enhancements and/or maybe how
they solved the same problems you’re having. (I run it, BTW, and am getting
about a 98% success rate with my mail, being filtered into 5 different buckets,
1 of them being a “spam” catchall.)

Good stuff.
I’ll take a look at it after class.

···

On Fri, Feb 21, 2003 at 05:41:39PM +0900, MoonWolf wrote:

Results:
$ time ruby b_orig.rb
ruby b_orig.rb 12.46s user 0.01s system 100% cpu 12.467 total
$ time ruby b_filter.rb
ruby b_filter.rb 8.29s user 0.12s system 100% cpu 8.402 total
$ time ruby b_hashsucc.rb
ruby b_hashsucc.rb 5.00s user 0.01s system 100% cpu 5.004 total


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Nice. Perhaps could be made more general by checking if the value is a
Fixnum first? (And if not, it could do h = h.succ)

Regards,

Brian.

···

On Fri, Feb 21, 2003 at 05:41:39PM +0900, MoonWolf wrote:

I wrote faster extension 'hashsucc.so'.
glib is not needed.

i think that better way to speed up this code is:

h = data.scan(iptok).sort.uniqc.to_hash

I’m using Ruby 1.6.8, which doesn’t have a to_hash or uniqc method. When
were these methods introduced?

-Travis

Yeah. I’ll think about writing an Array#uniqc method this afternoon. It
might turn out to be easier than the one I wrote yesterday.

···

On Thu, Feb 20, 2003 at 03:57:03PM +0900, Bulat Ziganshin wrote:

Hello Daniel,

Thursday, February 20, 2003, 9:13:16 AM, you wrote:

I think that sorting is about as expensive as hashing.
Here is a quick benchmark for 5,000 iterations:

i think that really expensive is interpreting the code.
if you think about C code, you can try to squeeze second
block in my program into Array#uniqc method. imvho,
it is the simplest way to speed up your program - just
because other variants will require much more efforts
to code


Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137

Yeah, I understand the learning bit :slight_smile: I do lots of stuff that other
people have done, just for the hell of it.

You’ll just feed each corpus file into the filter, one marked as spam,
the other marked as good. After that, you’re done. Use procmail to
look at each message, decide if it is spam or not, then place it in a
spam folder if its spam, or let it go through the rest of your filters
if not. There’s also a mutt keybinding that marks it as spam/good mail,
which removes it as spam, then re-adds it as good.

Also, if you want help with RubyInline, I suggest you mail
ruby@zenspider.com, or the author himself, he doesn’t read this list.

···

Travis Whitton (whitton@atlantic.net) wrote:

Since I’ve lost the original email, I’ll just say that I there’s a
more efficient way to do what you’re doing. Keep a real database of
tokens, and use mail to update it, rather than re-tokenizing the whole
email set.

This is a good idea, and it looks like an interesting project. Does it allow
you to create a database file from already existing corpus files, or does it
require you to start from scratch? I’ve spent a good amount of time building
up the 2000 some-odd messages I’m currently using to gather my statistics,
and I’d hate to have two huge corpus files wasted. Also, the main reason I
started this project was to learn, so I’d still like to find a fast way to
update a hash key without having to access it twice.


Eric Hodel - drbrain@segment7.net - http://segment7.net
All messages signed with fingerprint:
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04