Hi Algorithmists,
I am writing a spell checker in ruby. The first step
is to load the dictionary into memory. After some
experiments, I found that the following simple code
worked quite nicely:
f=File.new(“dict.txt”)
text=f.sysread(File.size(“dict.txt”))
lexicon=text.split(/\n/)
On my 1.6G Pentium 4 running ruby 1.6.7, the total
time used to load the dictionary is less than 0.3
second. (Dictionary file is about 1Mb with 90K
records)
However, since the dictionary lookup operation will
be quite heavy, I am thinking of using a hash instead
of array. I tried the following code:
f=File.new(“dict.txt”)
text=f.sysread(File.size(“dict.txt”))
words=text.split(/\n/)
lexicon={}
words.each do |word|
lexicon[word]=0
end
Disaster! It took me about 15 seconds to load the
dictionary. Problem is that the #each method took too
much time.
I have 2 questions:
-
Is it worth to use hash instead of array with binary
search?
-
If I want to use hash, how to minimize the dictionary
load time? BTW, I tried to convert the dict.txt file into
a ruby command, i.e., dictionary={‘first’=>0,‘second’=>0,
…}, but system hangs when I tried to eval(text) after
sysread…
Thanks a lot!
Shannon
···
MSN 8 with e-mail virus protection service: 2 months FREE*
http://join.msn.com/?page=features/virus
However, since the dictionary lookup operation will
be quite heavy, I am thinking of using a hash instead
of array. I tried the following code:
f=File.new(“dict.txt”)
text=f.sysread(File.size(“dict.txt”))
words=text.split(/\n/)
lexicon={}
words.each do |word|
lexicon[word]=0
end
Disaster! It took me about 15 seconds to load the
dictionary. Problem is that the #each method took too
much time.
This code took 3 or 4 seconds to complete on my 900MHz PIII with 256Mb RAM.
h = {}
x = [“abd123”] * 90000
x.each_with_index do |y, i|
h[y + i.to_s] = 1
end
puts h.size
I have 2 questions:
- Is it worth to use hash instead of array with binary
search?
I’d say so. Hashes are extremely fast in general, and I presume the same in
Ruby. Looking up strings in a hash is very fast [O(1)], and faster than a
binary search [O(log2 n)].
- If I want to use hash, how to minimize the dictionary
load time? BTW, I tried to convert the dict.txt file into
a ruby command, i.e., dictionary={‘first’=>0,‘second’=>0,
…}, but system hangs when I tried to eval(text) after
sysread…
I wouldn’t bother with evaling like that. Prefer an elegant solution to a fast
solution (which I highly doubt eval is) until your program is complete enough
for you to start worrying about execution times.
Thanks a lot!
Shannon
Gavin
···
From: “Shannon Fang” xrfang@hotmail.com
Hello Shannon,
Wednesday, December 18, 2002, 5:19:50 PM, you wrote:
f=File.new(“dict.txt”)
text=f.sysread(File.size(“dict.txt”))
lexicon=text.split(/\n/)
if lexicon.size % 2 == 0
hash = Hash[ *lexicon ]
hash.update Hash[ *(lexicon[1…-1] + [0]) ]
else
hash = Hash[ *(lexicon + [0]) ]
hash.update Hash[ *lexicon[1…-1] ]
end
as alternative, see Marshal class:
-
run this program one time:
… # your code for reading lexicon from text file
File.open(“posterity”, “w+”) do |f|
Marshal.dump(lexicon, f)
end
-
use this code to restore lexicon:
lexicon = ‘’
File.open(“posterity”) do |f|
lexicon = Marshal.load(f)
end
… # using lexicon
···
–
Best regards,
Bulat mailto:bulatz@integ.ru
Hello dblack,
Wednesday, December 18, 2002, 6:14:15 PM, you wrote:
words.each do |word|
lexicon[word]=0
end
Disaster! It took me about 15 seconds to load the
dictionary. Problem is that the #each method took too
much time.
That’s really puzzling. I ran the same script in .23s real time,
using a dict with about 38K words, on a 1.4GHz Pentium.
hmm but in any case we need more functions to do fast computations
with hashes. i propose:
Array#to_hash as reverse to Hash#to_a
Hash#update(anArray) as equivalent to Hash#update(anArray.to_hash)
Hash#update(anArray,value) as equivalent to
anArray.each do |word|
aHash[word]=value
end
Hash.new(anArray,value) as equivalent to
aHash = {}
anArray.each do |word|
aHash[word]=value
end
···
–
Best regards,
Bulat mailto:bulatz@integ.ru