Here's what I'd probably do.
Create a custom class (and not use a Hash) for this, e.g.
Score = Struct.new :seq, :score
Create another structure for caching scores and a bit representation
for downcase dependent on cutoff valiue, e.g.
ScoreCache = Struct.new :score do
def mask(cutoff)
cache[cutoff]
end
def mask_sequence(cutoff, seq)
mask(cutoff).each_bit do |idx|
seq[idx] = seq[idx].downcase!
end
seq
end
private
def cache
@cache ||= Hash.new do |h,cutoff|
c = 0
self.score.each_with_index |ch,idx|
c |= (1 << idx) if ch.ord - BASE_SOLEXA < cutoff
end
h[cutoff] = c
end
end
end
# Store score string -> ScoreCache
global_score_cache = Hash.new do |h,score|
h[score] = ScoreCache.new score
end
class Integer
def each_bit
raise "Currently only positive implemented" if self < 0
if block_given?
idx = 0
x = self
while x != 0
yield idx if x[0] == 1
idx += 1
x >>= 1
end
self
else
Enumerator.new self, :each_bit
end
end
end
And then use it and profile.
Kind regards
robert
···
2010/8/30 Martin Hansen <mail@maasha.dk>:
You could do this though
seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end
Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.
Yes, converting scores to arrays is bad since the scores are parsed from
files as strings (millions of them). And I am unsure if substr
substitutions are very efficient ...
We need to trick this into the regex engine somehow.
How about transforming scores to a mask string like this: 000111 where 1
indicates that the corresponding sequence char should be lowercased
(that can be done with tr). Then we plug this onto the sequence string:
seq = "ATCGAT000111"
And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/