Speed issues iterating over chars

Martin_Hansen · 30 August 2010 07:58

Hello all,

This snip of code needs more speed. Ideas? What is the state of Ruby and
Inline C?

Cheers,

Martin

class Hash
BASE_SOLEXA = 64

  # Soft masks sequence residues where the corresponding quality score
  # is below a given cutoff.
  def mask!(cutoff)
    if self.has_key? :SEQ and self.has_key? :SCORES
      seq = self[:SEQ]
      scores = self[:SCORES]
      i = 0

      scores.each_char do |score|
        seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
        i += 1
      end

self[:SEQ] = seq
end

self
end
end

···

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 30 August 2010 09:02

What about using String#gsub!?

# untested
seq.gsub! /(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/ do |m|
m.downcase!
end

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS = Hash.new {|h,cutoff| h[cutoff] =
/(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
m.downcase!
end

Kind regards

robert

···

2010/8/30 Martin Hansen <mail@maasha.dk>:

Hello all,

This snip of code needs more speed. Ideas? What is the state of Ruby and
Inline C?

Cheers,

Martin

class Hash
BASE_SOLEXA = 64

# Soft masks sequence residues where the corresponding quality score
# is below a given cutoff.
def mask!(cutoff)
if self.has_key? :SEQ and self.has_key? :SCORES
seq = self[:SEQ]
scores = self[:SCORES]
i = 0
 scores\.each\_char do |score|
   seq\[i\] = seq\[i\]\.downcase if score\.ord \- BASE\_SOLEXA &lt;= cutoff
   i \+= 1
 end

 self\[:SEQ\] = seq
end

self
end
end

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Martin_Hansen · 30 August 2010 13:27

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS = Hash.new {|h,cutoff| h[cutoff] =
/(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
m.downcase!
end

This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|"\\u00%02x" % ch}

Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ' '. It doesn't
look like Ruby have any methods bitwise operations on strings.

Cheers,

Martin

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 30 August 2010 13:45

You could speed up calculation of the regexp by placing this in the
default block of a Hash, e.g.

RXS = Hash.new {|h,cutoff| h[cutoff] =
/(?:#{(0..(BASE_SOLEXA+cutoff)).map{|ch|"\\u00%02x" %
ch}.join("|")})+/

and then

seq.gsub! RXS[cutoff] do |m|
m.downcase!
end

This could be a good idea, but I am not entirely sure if it will work -
at least the current suggestion gives an empty RXS hash. And I am unsure
about what this do? map{|ch|"\\u00%02x" % ch}

Just try it out in IRB. This creates a regexp with unicode names.
You could however use Fixnum#chr instead, e.g.

#{ (0..(BASE_SOLEXA + cutoff)).map {|ch| Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.

Thinking about this problem I wonder if it would be an idea to create a
mask with transliterate and then use a bitwise operation to downcase.
IIRC you change the case with a bitwise | operation with ' '. It doesn't
look like Ruby have any methods bitwise operations on strings.

AFAIK there are none. But you can do

irb(main):011:0> s = "aaa"
=> "aaa"
irb(main):012:0> s[0] = (s[0].ord | 0x0F).chr
=> "o"
irb(main):013:0> s
=> "oaa"

Kind regards

robert

···

2010/8/30 Martin Hansen <mail@maasha.dk>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Martin_Hansen · 30 August 2010 14:04

#{ (0..(BASE_SOLEXA + cutoff)).map {|ch|
Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.

OK, so for the right range that is equal to:

regex = Regexp.union((-5 .. cutoff).collect { |n| (n + BASE_SOLEXA).chr
} )

Which returns:

(?-mix:;|<|=|>|\?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T)

But I dont get how the substitution will scan one string (scores) an
make changes to the other (seq):

seq = 'TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT'
scores = '@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^]\[ZYX'

?

AFAIK there are none.

That is actually strange. In Perl you can do this:

perl -le 'print "ABCDEFG" | " \0\0 "'
=> abCDefg

Which I believe is very efficient. And the mask with " " and "\0" can be
constructed with transliterate efficiently too.

Cheers,

Martin

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 30 August 2010 14:29

#{ (0..(BASE_SOLEXA + cutoff)).map {|ch|
Regexp.escape(ch.chr)}.join("|") }

This gives shorter and more readable strings.

OK, so for the right range that is equal to:

regex = Regexp.union((-5 .. cutoff).collect { |n| (n + BASE_SOLEXA).chr
} )

Which returns:

(?-mix:;|<|=|>|\?|@|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T)

But I dont get how the substitution will scan one string (scores) an
make changes to the other (seq):

seq = 'TTGGTCGCTCGCTCCGCGACCTCAGATCAGACGTGGGCGAT'
scores = '@ABCDEFGHIJK?MNOPQRSTUVWhgfedcba`_^]\[ZYX'

?

It doesn't. Somehow I must have missed an important detail...

You could do this though

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

AFAIK there are none.

That is actually strange. In Perl you can do this:

perl -le 'print "ABCDEFG" | " \0\0 "'
=> abCDefg

Which I believe is very efficient. And the mask with " " and "\0" can be
constructed with transliterate efficiently too.

Kind regards

robert

···

2010/8/30 Martin Hansen <mail@maasha.dk>:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Martin_Hansen · 30 August 2010 14:55

You could do this though

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

Yes, converting scores to arrays is bad since the scores are parsed from
files as strings (millions of them). And I am unsure if substr
substitutions are very efficient ...

We need to trick this into the regex engine somehow.

How about transforming scores to a mask string like this: 000111 where 1
indicates that the corresponding sequence char should be lowercased
(that can be done with tr). Then we plug this onto the sequence string:

seq = "ATCGAT000111"

And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?

Martin

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 1 September 2010 08:25

Here's what I'd probably do.

Create a custom class (and not use a Hash) for this, e.g.

Score = Struct.new :seq, :score

Create another structure for caching scores and a bit representation
for downcase dependent on cutoff valiue, e.g.

ScoreCache = Struct.new :score do
  def mask(cutoff)
    cache[cutoff]
  end

  def mask_sequence(cutoff, seq)
    mask(cutoff).each_bit do |idx|
      seq[idx] = seq[idx].downcase!
    end

seq
end

private
  def cache
    @cache ||= Hash.new do |h,cutoff|
      c = 0

      self.score.each_with_index |ch,idx|
        c |= (1 << idx) if ch.ord - BASE_SOLEXA < cutoff
      end

      h[cutoff] = c
    end
  end
end

# Store score string -> ScoreCache
global_score_cache = Hash.new do |h,score|
h[score] = ScoreCache.new score
end

class Integer
def each_bit
raise "Currently only positive implemented" if self < 0

    if block_given?
      idx = 0
      x = self

      while x != 0
        yield idx if x[0] == 1
        idx += 1
        x >>= 1
      end

      self
    else
      Enumerator.new self, :each_bit
    end
  end
end

And then use it and profile.

Kind regards

robert

···

2010/8/30 Martin Hansen <mail@maasha.dk>:

You could do this though

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

Not too nice though. You could however do some preparation, e.g.
store scores as an Array of Fixnum instead of using #ord.

Yes, converting scores to arrays is bad since the scores are parsed from
files as strings (millions of them). And I am unsure if substr
substitutions are very efficient ...

We need to trick this into the regex engine somehow.

How about transforming scores to a mask string like this: 000111 where 1
indicates that the corresponding sequence char should be lowercased
(that can be done with tr). Then we plug this onto the sequence string:

seq = "ATCGAT000111"

And then we construct a regex with a forward looking identifier that
reads the mask and manipulates the ATCG chars?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Martin_Hansen · 1 September 2010 11:35

Thanks Robert,

The gsub solution seems to be reasonably efficient.

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

But my original proposed naive loop is twice as fast:

scores.each_char do |score|
seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end

I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.

I suspect that you need some C extension to go faster than this, but I
don't really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.

Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.

It is a shame that Ruby does not have build in bitwise operators for the
String class allowing C speed masking. I have tried to get on the core
mailing list to post this as a feature suggestion, but the ML form is
stuffed.

Cheers,

Martin

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 1 September 2010 14:27

Thanks Robert,

The gsub solution seems to be reasonably efficient.

seq.gsub! /./ do |m|
scores[$`.length].ord - BASE_SOLEXA < cutoff ? m.downcase! : m
end

But my original proposed naive loop is twice as fast:

scores.each_char do |score|
seq[i] = seq[i].downcase if score.ord - BASE_SOLEXA <= cutoff
i += 1
end

I dont really know how gsub and tr compares to the Perl equivalents
speed wise - in Perl tr is precompiling a lookup table that is evil fast
and the regex engine is also primed at compile time and runs extremely
fast.

Regexp is precompiled but I suspect that tr works at runtime only.
For definitive answer you'll have to look at the source.

I suspect that you need some C extension to go faster than this, but I
don't really want to spend the time on that. I was exploring Inline C
but that appears very fragile - I cannot even get the example from the
cookbook up and running under Ruby 191/192.

Looking at your last proposal I see three iterations where two are
running narrow if loops. I have not testet it, but it looks suspicious.

Well, but the caching should avoid that too many loops are executed.
I do not know however, how often you reuse values. If you need this
in several processes you could save the current state in a file via
Marshal which is quite fast.

Kind regards

robert

···

On Wed, Sep 1, 2010 at 1:35 PM, Martin Hansen <mail@maasha.dk> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Topic		Replies	Views
Optimization anyone ruby-talk	5	67	22 November 2005
String#chop slow? REALLY slow? ruby-talk	13	143	2 August 2006
Slow regular expressions :( ruby-talk	28	135	28 July 2006
Need for speed -> a C extension? ruby-talk	27	190	15 May 2011
Performance stats of String#scan, strscan and a homemade approach ruby-talk	1	115	3 August 2004

Speed issues iterating over chars

Related topics