“Travis Whitton” whitton@atlantic.net schrieb im Newsbeitrag
news:BhZ4a.10243$Mr5.3967@fe06.atl2.webusenet.com…
- Are the tokens strings?
Yes, my program goes through two files. One consists of only non-spam
messages, and the other is only spam messages. It goes through each file
line by line and divides each line into tokens of interesting data. Here
are
the relevant portions of my tokenizer method.
def tokenizer(fh)
hash = Hash.new(0)
ipaddr = ‘[0-9]+.[0-9]+.[0-9]+.[0-9]+’
Maybe you can improve performance by changing this to:
ipaddr = ‘[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}’
or
ipaddr = ‘[0-9]{1,3}(.[0-9]{1,3}){3}’
This should make the regexp fail faster for longer sequences of digits.
Just a guess, but maybe worth trying.
Regards
robert
token = “[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]”
iptok = Regexp.compile(“#{token}|#{ipaddr}”)
fh.each do |data|
data.chomp!
# do a number of string substitutions which use negligible amounts of
time
data.scan(iptok).each do |tok|
hash[tok] = hash[tok].succ
end
end
hash
end
The messages are standard unix messages(mbox format?) like so:
From MAILER-DAEMON Mon Sep 23 22:32:37 2002
Date: 23 Sep 2002 22:32:37 -0400
From: Mail System Internal Data MAILER-DAEMON@grub.ath.cx
Subject: DON’T DELETE THIS MESSAGE – FOLDER INTERNAL DATA
Message-ID: 1032834757@grub.atlantic.net
X-IMAP: 1032834509 0000000272
Status: RO
This text is part of the internal format of your mail folder, and is not
a real message. It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.
- Are you running Linux?
Yes, and I only intend for this program to run under unix based systems.
You might be able to use glib hashes to do this in C and then translate
those to Ruby hashes. Just a thought. I’m putting together some code
to
see if I can figure it out.
Thanks very much for your help. I sincerely appreciate it! As a side
note,
the program is already on the RAA:
http://raa.ruby-lang.org/list.rhtml?name=bsproc
So, you can grok through the code if you would like to; however, it’s not
exactly the same as the development version. As a second side not,
although
it is slow to create the probability database, the program is extremely
good at filtering spam. A very small minority of spam messages make their
way
into my inbox, and it feels damned good to have written my own spam
filter.
Also, I’ve tried strscan in place of scan, and the speedup wasn’t
significant.
As it turns out, most of the calculation time is spend doing hash
lookups. I’ve
···
considered RJudy, but apparently, it’s hashes are slower than Ruby native
hashes… go figure.
Thanks much,
Travis