Implementing a simple and efficient index system

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

I could save all processed id's in an array and then check if the array
includes my current id:

sequences = []
some kind of loop magic
if sequences.include?(id)
  process file
  sequences << id
end
end

But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.

I could also save all processed id's as keys of a hash, however I don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
  process file
  sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!

···

--
Posted via http://www.ruby-forum.com/.

Janus Bor wrote:

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key)

How do you know that? Did you try it, as an experiment?

Janus Bor wrote:

I could also save all processed id's as keys of a hash, however I don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
  process file
  sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

It's not so bad to use true as a hash value. But if it bothers you, there is the Set class, which is really a hash underneath, but the interface is set-membership rather than associative lookup:

require 'set'

s = Set.new

s << 123
s << 456

p s.include?(456) # ==> true
p s.include?(789) # ==> false

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Janus Bor wrote:

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

Can you simply use the id as a filename, and check for file existence before writing? If you file system doesn't handle huge dirs well, then split the id into several terms. But I'd try the hash or set approach first, to avoid all the system calls.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

the simplest and most robust method is probably going to be to use sqlite to store the id of each sequence. this will help you in the case if a program crash and as you develop. for example:

cfp:~ > ruby a.rb

cfp:~ > sqlite3 .proteins.db 'select * from proteins'
42|ABC123

cfp:~ > ruby a.rb
a.rb:27:in `=': 42 (IndexError)
         from /opt/local/lib/ruby/gems/1.8/gems/amalgalite-0.2.1/lib/amalgalite/database.rb:477:in `transaction'
         from a.rb:24:in `='
         from a.rb:6

cfp:~ > sqlite3 .proteins.db 'select * from proteins'
42|ABC123

cfp:~ > cat a.rb

db = ProteinDatabase.new

id, sequence = 42, 'ABC123'

db[id] = sequence

BEGIN {

   require 'rubygems'
   require 'amalgalite'

   class ProteinDatabase
     SCHEMA = <<-SQL
       create table proteins(
         id integer primary key,
         sequence blob
       );
     SQL

     def = id, sequence
       @db.transaction {
         query = 'select id from proteins where id=$id'
         rows = @db.execute(query, '$id' => id)
         raise IndexError, id.to_s if rows and rows[0] and rows[0][0]
         blob = blob_for( sequence )
         insert = 'insert into proteins values ($id, $sequence)'
         @db.execute(insert, '$id' => id, '$sequence' => blob)
       }
     end

   private
     def initialize path = default_path
       @path = path
       setup!
     end

     def setup!
       @db = Amalgalite::Database.new @path
       unless @db.schema.tables['proteins']
         @db.execute SCHEMA
         @db = Amalgalite::Database.new @path
       end
       @sequence_column = @db.schema.tables['proteins'].columns['sequence']
     end

     def blob_for string
       Amalgalite::Blob.new(
         :string => string,
         :column => @sequence_column
       )
     end

     def default_path
       File.join( home, '.proteins.db' )
     end

     def home
       home =
         catch :home do
           ["HOME", "USERPROFILE"].each do |key|
             throw(:home, ENV[key]) if ENV[key]
           end

           if ENV["HOMEDRIVE"] and ENV["HOMEPATH"]
             throw(:home, "#{ ENV['HOMEDRIVE'] }:#{ ENV['HOMEPATH'] }")
           end

           File.expand_path("~") rescue(File::ALT_SEPARATOR ? "C:/" : "/")
         end

       File.expand_path home
     end
   end

}

a @ http://codeforpeople.com/

···

On Jul 6, 2008, at 12:22 PM, Janus Bor wrote:

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

I could save all processed id's in an array and then check if the array
includes my current id:

sequences =
some kind of loop magic
if sequences.include?(id)
process file
sequences << id
end

But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.

I could also save all processed id's as keys of a hash, however I don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
process file
sequences[id] = true
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!
--
Posted via http://www.ruby-forum.com/\.

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

BioRuby+BioSQL ?
You can fetch a sequence from servers and dump it directly into the
database. You can choose MySQL, PostgreSQL, SqLite

ok it's not well coded but works:
  server = Bio::Fetch.new('http://www.ebi.ac.uk/cgi-bin/dbfetch&#39;\)
  ARGV.flags.accession.split.each do |accession|
    puts accession
    if Bio::SQL.exists_accession(accession)
      puts "Entry #{accession} already exists!"
    else
      entry_str = server.fetch('embl', accession, 'raw', 'embl')

      if entry_str=="No entries found\. \n"
        $stderr.puts "Error: no entry #{accession} found.
#{entry_str}"
      else
        puts "Downloaded!"
        puts "Loading..."
        puts "Converting EMBL obj..."
        entry = Bio::EMBL.new(entry_str)
        puts "Converting Biosequence obj..."
        biosequence = entry.to_biosequence
        puts "Saving Biosequence into Bio::SQL::Sequence database"
        result =
Bio::SQL::Sequence.new(:biosequence=>biosequence,:biodatabase_id=>db.id)
unless Bio::SQL.exists_accession(biosequence.primary_accession)
        puts entry.entry_id
        if result.nil?
          pp "The sequence is already present into the biosql
database"
        else
          pp "Stored."
        end
        end#notfound on web
        end#bioentry exists
      end #list accession

PS: I need to write docs about BioSQL and Ruby, sorry my fault.

···

On Jul 6, 8:22 pm, Janus Bor <ja...@urban-youth.com> wrote:

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

I could save all processed id's in an array and then check if the array
includes my current id:

sequences =
some kind of loop magic
if sequences.include?(id)
  process file
  sequences << id
end
end

But I suspect that sequences.include?(id) would iterate over the whole
array until it finds a match. As this array might have up 50 000
positions and I will have to do this check for every sequence, this
would probably be very inefficient.

I could also save all processed id's as keys of a hash, however I don't
have any use for a value:

sequences = {}
some kind of loop magic
if sequences[id]
  process file
  sequences[id] = true
end
end

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

Thanks in advance!
--
Posted viahttp://www.ruby-forum.com/.

--
Ra

You can use BioRuby+BioSQL, fetching data from a remote server and
storing into the db.

···

On 6 Lug, 20:22, Janus Bor <ja...@urban-youth.com> wrote:

Hello everyone,

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key) and then write them to hd
after downloading has finished. So my idea is to write every sequence to
the corresponding file immediately, but first I have to check if it has
been processed already.

phlip wrote:

Janus Bor wrote:

I'm pretty new to Ruby and programming in general. Here's my problem:

I'm writing a program that will automatically download protein sequences
from a server and write them into the corresponding file. Every single
sequence has a unique id and I have to eliminate duplicates. However, as
the number of sequences might exceed 50 000, I can't simply save all
sequences in a hash (with their id as key)

How do you know that? Did you try it, as an experiment?

No, I didn't try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.

Joel VanderWerf wrote:

Janus Bor wrote:

Would this method be more efficient? Is there a more elegant way? Also,
can Ruby handle arrays/hashes of this size?

It's not so bad to use true as a hash value. But if it bothers you,
there is the Set class, which is really a hash underneath, but the
interface is set-membership rather than associative lookup:

require 'set'

s = Set.new

s << 123
s << 456

p s.include?(456) # ==> true
p s.include?(789) # ==> false

Thanks, that's exactly what I was looking for! I didn't know set
basically works like a hash without a key...

···

--
Posted via http://www.ruby-forum.com/\.

Janus Bor wrote:

No, I didn't try it and it might actually work: Every sequence has a size of ~1kb, so 50 000 sequences would probably be around 50mb. But getting all this data will take hours, so I need to implement a system that will not lose all data if the program is terminated abnormally.
  

Here are some simple alternatives for persisting and retrieving your data in the order I'd recommend them based on what you've described so far:

1. PStore standard library: Put your objects into a magical hash, that's automatically persisted to a file. Probably the quickest and easiest solution. See http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

2. Lightweight SQL database: Maybe store sequences in SQLite as BLOBs. Probably the best long-term solution, but will require you to work harder to transform data to and from storage. See http://sqlite-ruby.rubyforge.org/

3. Marshall core class: Dump objects to and from strings, and then files. Useful if you need something more than PStore, but still want to persist objects directly. See module Marshal - RDoc Documentation

Best of luck.

-igal

No, I didn't try it and it might actually work: Every sequence has a
size of ~1kb, so 50 000 sequences would probably be around 50mb. But
getting all this data will take hours, so I need to implement a system
that will not lose all data if the program is terminated abnormally.

Try it with random data first. That way, you know the behavior under load without paying the acquisition time.

  - Robert

Thanks, that's exactly what I was looking for! I didn't know set
basically works like a hash without a key...

make that "without a value".

Robert

···

--
http://ruby-smalltalk.blogspot.com/

---
AALST (n.) One who changes his name to be further to the front
D.Adams; The Meaning of LIFF

Igal Koshevoy wrote:

Janus Bor wrote:

No, I didn't try it and it might actually work: Every sequence has a size of ~1kb, so 50 000 sequences would probably be around 50mb. But getting all this data will take hours, so I need to implement a system that will not lose all data if the program is terminated abnormally.
  

Here are some simple alternatives for persisting and retrieving your data in the order I'd recommend them based on what you've described so far:

1. PStore standard library: Put your objects into a magical hash, that's automatically persisted to a file. Probably the quickest and easiest solution. See http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

PStore writes the whole file at once, not incrementally. Not really what OP is looking for, IMO.

2. Lightweight SQL database: Maybe store sequences in SQLite as BLOBs. Probably the best long-term solution, but will require you to work harder to transform data to and from storage. See http://sqlite-ruby.rubyforge.org/

Not clear that would be better than files. Maybe so, if the individual strings are short. Would be interesting to get some benchmarks on this question.

3. Marshall core class: Dump objects to and from strings, and then files. Useful if you need something more than PStore, but still want to persist objects directly. See module Marshal - RDoc Documentation

PStore uses Marshall, so it's odd to say that Marshall is more than PStore.

If you're looking for a way to manage marshalled (or string or yaml...) data in multiple files, using file paths as db keys, look no further than:

http://raa.ruby-lang.org/project/fsdb/

I think the Set/Hash + many files option is best here, though.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Robert Dober wrote:

Thanks, that's exactly what I was looking for! I didn't know set
basically works like a hash without a key...

make that "without a value".

For sets in Perl I've used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?

···

--
Posted via http://www.ruby-forum.com/\.

Which makes an interesting contrast between Ruby and Smalltalk.

In Smalltalk-80 Set is the more "fundamental" class, the implementation uses
hashing to ensure that duplicates are eliminated and to speed up the test of
whether or not a Set contains a given element.

Smalltalks equivalent to Hash, the Dictionary class, is implemented (via
inheritance) as a Set of association objects, where an association
represents a key value pair, and where two associations are equal if the
keys are equal, and the hash of the association is the hash of the key.

Ruby on the other hand implements Set as a Hash where the values are
unimportant, and does this via delegating to a hash rather than via
inheritance.

···

On Mon, Jul 7, 2008 at 3:12 AM, Robert Dober <robert.dober@gmail.com> wrote:

>
> Thanks, that's exactly what I was looking for! I didn't know set
> basically works like a hash without a key...

make that "without a value".

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

Joel VanderWerf wrote:

Igal Koshevoy wrote:

1. PStore standard library: Put your objects into a magical hash, that's automatically persisted to a file. Probably the quickest and easiest solution. See http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

PStore writes the whole file at once, not incrementally. Not really what OP is looking for, IMO.

It takes ~2s for my machine to read or write the 50MB PStore file. This isn't a big deal if the original poster (OP) doesn't mind keeping the program running to process multiple sequences at once.

2. Lightweight SQL database: Maybe store sequences in SQLite as BLOBs. Probably the best long-term solution, but will require you to work harder to transform data to and from storage. See http://sqlite-ruby.rubyforge.org/

Not clear that would be better than files. Maybe so, if the individual strings are short. Would be interesting to get some benchmarks on this question.

Files would probably be faster, but with such a small dataset, we're probably talking about less than a second of difference for processing the full dataset. I like using SQLite for stuff like this because it provides a standard, out-of-the-box solution for working with persistence, incremental processing, structured data, queries, and the ability to easily add more fields to a record.

3. Marshal core class: Dump objects to and from strings, and then files. Useful if you need something more than PStore, but still want to persist objects directly. See module Marshal - RDoc Documentation

PStore uses Marshal, so it's odd to say that Marshal is more than PStore.

Working directly with Marshall allows greater flexiblity than using the PStore wrapper, for example, if they decided to write a filesystem database class. :slight_smile:

If you're looking for a way to manage marshalled (or string or yaml...) data in multiple files, using file paths as db keys, look no further than: http://raa.ruby-lang.org/project/fsdb/

Cool project, thanks for writing it. Sounds useful.

-igal

true might be a better choice than nil :wink:

{}[42] --> nil

R.

···

On Mon, Jul 7, 2008 at 5:55 PM, Dave Bass <davebass@musician.org> wrote:

Robert Dober wrote:

Thanks, that's exactly what I was looking for! I didn't know set
basically works like a hash without a key...

make that "without a value".

For sets in Perl I've used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?

Igal Koshevoy wrote:

Joel VanderWerf wrote:

Igal Koshevoy wrote:

1. PStore standard library: Put your objects into a magical hash, that's automatically persisted to a file. Probably the quickest and easiest solution. See http://www.ruby-doc.org/stdlib/libdoc/pstore/rdoc/classes/PStore.html

PStore writes the whole file at once, not incrementally. Not really what OP is looking for, IMO.

It takes ~2s for my machine to read or write the 50MB PStore file. This isn't a big deal if the original poster (OP) doesn't mind keeping the program running to process multiple sequences at once.

I got the impression that Mr. O. P. was trying to avoid waiting until the end of the download to write the file (maybe in case the network went down halfway through).

3. Marshal core class: Dump objects to and from strings, and then files. Useful if you need something more than PStore, but still want to persist objects directly. See module Marshal - RDoc Documentation

PStore uses Marshal, so it's odd to say that Marshal is more than PStore.

Working directly with Marshall allows greater flexiblity than using the PStore wrapper, for example, if they decided to write a filesystem database class. :slight_smile:

Less is more :wink:

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Robert Dober wrote:

···

On Mon, Jul 7, 2008 at 5:55 PM, Dave Bass <davebass@musician.org> wrote:

Robert Dober wrote:

Thanks, that's exactly what I was looking for! I didn't know set
basically works like a hash without a key...

make that "without a value".

For sets in Perl I've used hashes with an arbitrary value of 1, or
undef. In Ruby I guess that would be values of true or nil. Any better
suggestions, apart from using Set of course?

true might be a better choice than nil :wink:

{}[42] --> nil

R.

Tsk. Don't you know that "true or nil" evaluates to "true"? :stuck_out_tongue:

(Srsly, I think he meant true for membership and nil otherwise.)

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407