the simplest and most robust method is probably going to be to use sqlite to store the id of each sequence. this will help you in the case if a program crash and as you develop. for example:
cfp:~ > ruby a.rb
cfp:~ > sqlite3 .proteins.db 'select * from proteins'
42|ABC123
cfp:~ > ruby a.rb
a.rb:27:in `=': 42 (IndexError)
from /opt/local/lib/ruby/gems/1.8/gems/amalgalite-0.2.1/lib/amalgalite/database.rb:477:in `transaction'
from a.rb:24:in `='
from a.rb:6
cfp:~ > sqlite3 .proteins.db 'select * from proteins'
42|ABC123
cfp:~ > cat a.rb
db = ProteinDatabase.new
id, sequence = 42, 'ABC123'
db[id] = sequence
BEGIN {
require 'rubygems'
require 'amalgalite'
class ProteinDatabase
SCHEMA = <<-SQL
create table proteins(
id integer primary key,
sequence blob
);
SQL
def = id, sequence
@db.transaction {
query = 'select id from proteins where id=$id'
rows = @db.execute(query, '$id' => id)
raise IndexError, id.to_s if rows and rows[0] and rows[0][0]
blob = blob_for( sequence )
insert = 'insert into proteins values ($id, $sequence)'
@db.execute(insert, '$id' => id, '$sequence' => blob)
}
end
private
def initialize path = default_path
@path = path
setup!
end
def setup!
@db = Amalgalite::Database.new @path
unless @db.schema.tables['proteins']
@db.execute SCHEMA
@db = Amalgalite::Database.new @path
end
@sequence_column = @db.schema.tables['proteins'].columns['sequence']
end
def blob_for string
Amalgalite::Blob.new(
:string => string,
:column => @sequence_column
)
end
def default_path
File.join( home, '.proteins.db' )
end
def home
home =
catch :home do
["HOME", "USERPROFILE"].each do |key|
throw(:home, ENV[key]) if ENV[key]
end
if ENV["HOMEDRIVE"] and ENV["HOMEPATH"]
throw(:home, "#{ ENV['HOMEDRIVE'] }:#{ ENV['HOMEPATH'] }")
end
File.expand_path("~") rescue(File::ALT_SEPARATOR ? "C:/" : "/")
end
File.expand_path home
end
end
}
a @ http://codeforpeople.com/
···
On Jul 6, 2008, at 12:22 PM, Janus Bor wrote:
Hello everyone,
I'm pretty new to Ruby and programming in general. Here's my problem:
I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.
I could save all processed id's in an array and then check if the array
includes my current id:
sequences =
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end
But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.
I could also save all processed id's as keys of a hash, however I don't
have any use for a value:
sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end
Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?
Thanks in advance!
--
Posted via http://www.ruby-forum.com/\.
--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama