Need for speed -> a C extension?

Hello all,

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Cheers,

Martin

#!/usr/bin/env ruby

# IUPAC nucleotide pair ambiguity equivalents are saved in an
# array of bit fields.

BIT_A = 1 << 0
BIT_T = 1 << 1
BIT_C = 1 << 2
BIT_G = 1 << 3

EQUAL = Array.new(256, 0)
EQUAL['A'.ord] = BIT_A
EQUAL['T'.ord] = BIT_T
EQUAL['U'.ord] = BIT_T
EQUAL['C'.ord] = BIT_C
EQUAL['G'.ord] = BIT_G
EQUAL['M'.ord] = (BIT_A|BIT_C)
EQUAL['R'.ord] = (BIT_A|BIT_G)
EQUAL['W'.ord] = (BIT_A|BIT_T)
EQUAL['S'.ord] = (BIT_C|BIT_G)
EQUAL['Y'.ord] = (BIT_C|BIT_T)
EQUAL['K'.ord] = (BIT_G|BIT_T)
EQUAL['B'.ord] = (BIT_C|BIT_G|BIT_T)
EQUAL['D'.ord] = (BIT_A|BIT_G|BIT_T)
EQUAL['H'.ord] = (BIT_A|BIT_C|BIT_T)
EQUAL['V'.ord] = (BIT_A|BIT_C|BIT_G)
EQUAL['N'.ord] = (BIT_A|BIT_C|BIT_G|BIT_T)

# Module containing code to locate nucleotide patterns in sequences
allowing for
# ambiguity codes and a given maximum edit distance.
# Insertions are nucleotides found in the pattern but not in the
sequence.
# Deletions are nucleotides found in the sequence but not in the
pattern.

···

#
# Inspired by the paper by Bruno Woltzenlogel Paleo (page 197):
# http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf
module PatternMatcher
  #
------------------------------------------------------------------------------
  # str.match(pattern[, pos[, max_edit_distance]])
  # -> Match or nil
  #
  #
------------------------------------------------------------------------------
  # Method to locate the next pattern match starting from a given
position. A match
  # is allowed to contain a given maximum edit distance. If a match is
located a
  # Match object will be returned otherwise nil.
  def match(pattern, pos = 0, max_edit_distance = 0)
    @pattern = pattern
    @pos = pos
    @max_edit_distance = max_edit_distance
    @vector = vector_init

    while @pos < @seq.length
      vector_update

      return match_new if match_found?

      @pos += 1
    end
  end

  #
------------------------------------------------------------------------------
  # str.scan(pattern[, pos[, max_edit_distance]])
  # -> Array
  # str.scan(pattern[, pos[, max_edit_distance]]) { |match|
  # block
  # }
  # -> Match
  #
  #
------------------------------------------------------------------------------
  # Method to iterate through a sequence to locate pattern matches
starting
  # from a given position and allowing for a maximum edit distance.
  # Matches found in block context return the Match object. Otherwise
matches are
  # returned in an Array.
  def scan(pattern, pos = 0, max_edit_distance = 0)
    matches = []

    while match = match(pattern, pos, max_edit_distance)
      if block_given?
        yield match
      else
        matches << match
      end

      pos = match.pos + 1
    end

    return matches unless block_given?
  end

  private

  # Method to initailize the score vector and return this.
  def vector_init
    vector = []

    (0 ... @pattern.length + 1).each do |i|
      vector[i] = Score.new(matches = 0, mismatches = 0, insertions = i)
    end

    vector
  end

  # Method to update the score vector.
  def vector_update
    new_vector = @vector.dup

    (0 ... @pattern.length).each do |i|
      if match?(@seq[@pos], @pattern[i])
        new_vector[i + 1] = @vector[i].dup
        new_vector[i + 1].matches += 1
      else
        mismatch = @vector[i].dup
        insertion = new_vector[i].dup
        deletion = @vector[i + 1].dup

        if deletion?(mismatch, insertion, deletion)
          deletion.deletions += 1
          new_vector[i + 1] = deletion
        elsif mismatch?(mismatch, insertion, deletion)
          mismatch.mismatches += 1
          new_vector[i + 1] = mismatch
        elsif insertion?(mismatch, insertion, deletion)
          insertion.insertions += 1
          new_vector[i + 1] = insertion
        end
      end
    end

    @vector = new_vector
  end

  # Method to determine if a match occurred.
  def match?(char1, char2)
    (EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) != 0
  end

  # Method to determine if a mismatch occured.
  def mismatch?(mismatch, insertion, deletion)
    if mismatch.edit_distance <= insertion.edit_distance and
       mismatch.edit_distance <= deletion.edit_distance
      true
    end
  end

  # Method to determine if an insertion occured.
  def insertion?(mismatch, insertion, deletion)
    if insertion.edit_distance <= mismatch.edit_distance and
       insertion.edit_distance <= deletion.edit_distance
      true
    end
  end

  # Method to determine if a deletion occured.
  def deletion?(mismatch, insertion, deletion)
    if deletion.edit_distance <= mismatch.edit_distance and
       deletion.edit_distance <= insertion.edit_distance
      true
    end
  end

  # Method to print the score vector.
  def vector_print
    @vector.each do |s|
      puts s
    end

    puts
  end

  # Method that returns a Match object initialized with
  # information from the score vector.
  def match_new
    matches = @vector.last.matches
    mismatches = @vector.last.mismatches
    insertions = @vector.last.insertions
    deletions = @vector.last.deletions
    length = @pattern.length - insertions + deletions
    pos = @pos - length + 1
    match = @seq[pos ... pos + length]

    Match.new(pos, match, matches, mismatches, insertions, deletions,
length)
  end

  # Method that determines if a match was found by analyzing the score
vector.
  def match_found?
    if @vector.last.edit_distance <= @max_edit_distance
      true
    end
  end

  # Class to instantiate Score objects that holds score information.
  class Score
    attr_accessor :matches, :mismatches, :insertions, :deletions

    def initialize(matches = 0, mismatches = 0, insertions = 0,
deletions = 0)
      @matches = matches
      @mismatches = mismatches
      @insertions = insertions
      @deletions = deletions
    end

    # Method to calculate and return the edit distance.
    def edit_distance
      self.mismatches + self.insertions + self.deletions
    end

    private

    def to_s
      "(#{[self.matches, self.mismatches, self.insertions,
self.deletions].join(',')})"
    end
  end

  # Class for creating Match objects which contain the description of a
  # match between a nucleotide sequence and a pattern.
  class Match
    attr_reader :pos, :match, :matches, :mismatches, :insertions,
:deletions, :edit_distance, :length

    def initialize(pos, match, matches, mismatches, insertions,
deletions, length)
      @pos = pos
      @match = match
      @matches = matches
      @mismatches = mismatches
      @insertions = insertions
      @deletions = deletions
      @edit_distance = mismatches + insertions + deletions
      @length = length
    end
  end
end

--
Posted via http://www.ruby-forum.com/.

I pasted the code into a local file and ran it. It completed instantly.

It would be super helpful if you provided *all* of the code to make this run and do some useful work. By doing so, people who want to help you can run it through a profiler and identify hot spots.

cr

···

On Apr 18, 2011, at 10:15 AM, Martin Hansen wrote:

Hello all,

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Can you be a bit more specific about the slow part or even extract a
simple use case which demonstrates the issue you are having? Also,
copying it from your email and then fixing all the line breaks is
tedious. Please use a more appropriate means to distribute the code
in such cases (e.g. pastie, gist).

Kind regards

robert

···

On Mon, Apr 18, 2011 at 5:15 PM, Martin Hansen <mail@maasha.dk> wrote:

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Martin Hansen wrote:

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Please give a clear description of the algorithm, and then
give some sample input and output.

Sorry guys, I have been busy with a few other things before continuing
on this one (also solving the pesky issue with my RubyInline install).

@Robert.

I have been thinking hard about your comments - these contain a lot of
programming insight of the kind you dont get from a "learnings
<programming language> book". However, I struggle with grasping the
wisdom:

"Frankly, I find your code has a design issue: it seems you mix data
and iteration in a single class. This is visible from how #match
works

def match(pattern, pos = 0, max_edit_distance = 0)
    @pattern = pattern
    @pos = pos
    @max_edit_distance = max_edit_distance
    @vector = vector_init

...

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.
"

What is this "mixing data with iteration in a single class"? To me you
have data and then you iterate - what exactly is the problem? What
should I do for separating these?

&&

"Maybe on the interface, but you create side effects on the String (Seq
in your case). This is neither a clean separation (makes classes
bigger and harder to understand) nor is it thread safe (e.g. if you
want to try to concurrently match several patterns against the same
sequence)."

Again, "separation", but what is the problem with big classes? When is a
class too big? And how to divide your code in the best way?

Also, I managed to get down to a single vector, but I think I may have
more .dup's than needed - though removing any causes erroneous output:

http://pastie.org/1902844

Cheers,

Martin

···

--
Posted via http://www.ruby-forum.com/.

WJ wrote in post #993576:

Martin Hansen wrote:

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Please give a clear description of the algorithm, and then
give some sample input and output.

Here is a working version of the code that can be profiled (though it
will take forever with 20M iterations):

http://pastie.org/1808127

The slow part according to profiler is:

  % cumulative self self total
time seconds seconds calls ms/call ms/call name
29.39 2.66 2.66 1521 1.75 11.78 Range#each
15.80 4.09 1.43 33000 0.04 0.06 Seq#match?
10.72 5.06 0.97 78940 0.01 0.03 Kernel.dup
  9.28 5.90 0.84 78940 0.01 0.01
Kernel.initialize_dup
  6.63 6.50 0.60 142380 0.00 0.00
Seq::Score#edit_distance
  5.30 6.98 0.48 22220 0.02 0.03 Seq#deletion?
  3.54 7.30 0.32 66016 0.00 0.00 String#ord
  3.43 7.61 0.31 14680 0.02 0.04 Seq#mismatch?
  3.31 7.91 0.30 8300 0.04 0.05 Seq#insertion?

The input is DNA sequences. Basically strings of ATCG and Ns of length
50-100. These comes in files with 20M-30M sequences per file. I've got
~50 of these files and more incoming. The output will be truncated
sequences based on the match position located with this bit of code.

The algorithm is of the dynamic programming flavor and was inspired by
the paper by Bruno Woltzenlogel Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

Locating variable length matches is tricky!

Cheers,

Martin

···

--
Posted via http://www.ruby-forum.com/\.

Sorry guys, I have been busy with a few other things before continuing
on this one (also solving the pesky issue with my RubyInline install).

@Robert.

I have been thinking hard about your comments - these contain a lot of
programming insight of the kind you dont get from a "learnings
<programming language> book". However, I struggle with grasping the
wisdom:

"Frankly, I find your code has a design issue: it seems you mix data
and iteration in a single class. This is visible from how #match
works

def match(pattern, pos = 0, max_edit_distance = 0)
     @pattern = pattern
     @pos = pos
     @max_edit_distance = max_edit_distance
     @vector = vector_init

..

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.
"

What is this "mixing data with iteration in a single class"? To me you
have data and then you iterate - what exactly is the problem? What
should I do for separating these?

I am referring to http://pastie.org/1808127 : The issue with the code presented lies in the *state*: an instance of Seq represents a sequence of items (in your case amino acids, I guess). You need state to represent this sequence. This state is stored in instance variables of an instance of class Seq.

The first thing that your method #match (see above) does, is to set some other instance variables of Seq. This poses a problem if

- the Seq instance is frozen, i.e. immutable,

- more than one matching processes are under way concurrently (i.e. #match is invoked from more than one thread at a time).

The reason is that you mixed state needed to represent your sequence with state needed to execute the matching process. Apart from the problems listed above this also makes code harder to read and more error prone. For example, you might modify the Seq implementation and accidentally reuse the name of an instance variable which then can make your code break in unpredictable ways.

"Maybe on the interface, but you create side effects on the String (Seq
in your case). This is neither a clean separation (makes classes
bigger and harder to understand) nor is it thread safe (e.g. if you
want to try to concurrently match several patterns against the same
sequence)."

Again, "separation", but what is the problem with big classes? When is a
class too big? And how to divide your code in the best way?

There is no easy answer to that. In this case you violate modularity by lumping too many different functionalities together in a single class. Apart from the advantage to avoid mischief laid out above by decoupling the matching process from the sequence representation you also gain modularity if you let the matching process only rely on the public interface of Seq. That way you can even have different implementations of sequences which can all be scanned by the same representation. I am not saying you should necessarily do this as efficient matching might also need some knowledge of Seq internals, but this is to illustrate what kind of things you should consider when designing classes.

Also, I managed to get down to a single vector, but I think I may have
more .dup's than needed - though removing any causes erroneous output:

http://pastie.org/1902844

That code still stores matching state in the Seq instance. And you have quite a few #dups which have object creation overhead. I think you should better work with two Array of Score instances and swap them at each sequence position. IMHO you do not even need to reinitialize Score instances because of your dynamic programming approach it is guaranteed that you access instances sequentially from index 0 on and need to access at most last[i - 1], last[i] and current[i - 1].

Kind regards

  robert

···

On 15.05.2011 11:16, Martin Hansen wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

WJ wrote in post #993576:

Martin Hansen wrote:

The below code is too slow for practical use. I need it to run at least
20 times faster. Perhaps that is possible with some C code? I have no
experience with writing Ruby extensions. What are the pitfalls? Which
part of the code should be ported? Any pointers to get me started?

Please give a clear description of the algorithm, and then
give some sample input and output.

Here is a working version of the code that can be profiled (though it
will take forever with 20M iterations):

http://pastie.org/1808127

The slow part according to profiler is:

Profiling is great, but you also need to write a benchmark in order to objectively measure the work you're doing. You might also want to write microbenchmarks against the bottlenecks you identify in your profiling work. Eg:

def match?(char1, char2)
   (EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) != 0
end

That would be really easy to write tests and a benchmark against and then rewrite in C using RubyInline... without tests and a benchmark tho, you won't know that you've done it correctly and provides a measurable benefit.

···

On Apr 18, 2011, at 11:30 , Martin Hansen wrote:

def match?(char1, char2)
   (EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) != 0
end

That would be really easy to write tests and a benchmark against and
then rewrite in C using RubyInline... without tests and a benchmark tho,
you won't know that you've done it correctly and provides a measurable
benefit.

They bottlenecks are the iteration over the sequence (while loop at line
68) and the vector_update (line 120). I am a bit surprised that match?
is that slow - I expect it to be almost instantaneous in C. I would like
to test that with a benchmark and Inline C.

And I just discovered RubyInline! Inline code is so much easier to deal
with that an external library IMHO. I have been messing a bit around
with "inline", but that turns out to be some old crappy gem!!! Its not
the first time I have missed the right gem :o( ...

However, I get a peculiar error when testing RubyInline with the first
of the examples:

http://pastie.org/1811057

I am on Ruby 1.9.2.

Cheers,

Martin

···

--
Posted via http://www.ruby-forum.com/\.

Frankly, I find your code has a design issue: it seems you mix data
and iteration in a single class. This is visible from how #match
works

def match(pattern, pos = 0, max_edit_distance = 0)
    @pattern = pattern
    @pos = pos
    @max_edit_distance = max_edit_distance
    @vector = vector_init

...

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

Also #vector_update creates a lot of objects and does so for each
position in the sequence. That's likely where you can improve things.

I am not sure what the matching algorithm is exactly. Can you summarize it?

Ah, and one thing: if you add lowercase entries

EQUAL['a'.ord] = BIT_A
EQUAL['t'.ord] = BIT_T
EQUAL['u'.ord] = BIT_T
EQUAL['c'.ord] = BIT_C
EQUAL['g'.ord] = BIT_G
EQUAL['m'.ord] = (BIT_A|BIT_C)
EQUAL['r'.ord] = (BIT_A|BIT_G)
EQUAL['w'.ord] = (BIT_A|BIT_T)
EQUAL['s'.ord] = (BIT_C|BIT_G)
EQUAL['y'.ord] = (BIT_C|BIT_T)
EQUAL['k'.ord] = (BIT_G|BIT_T)
EQUAL['b'.ord] = (BIT_C|BIT_G|BIT_T)
EQUAL['d'.ord] = (BIT_A|BIT_G|BIT_T)
EQUAL['h'.ord] = (BIT_A|BIT_C|BIT_T)
EQUAL['v'.ord] = (BIT_A|BIT_C|BIT_G)
EQUAL['n'.ord] = (BIT_A|BIT_C|BIT_G|BIT_T)

you can make matching simpler

def match?(char1, char2)
    (EQUAL[char1.ord] & EQUAL[char2.ord]) != 0
end

You might as well consider changing the Array into a Hash. Then you
can even get rid of the #ord call.

A different approach would just use regular expressions, e.g.

MAP = {
  'a' => 1,
  't' => 2,
  'w' => '[12]',
...
}

seq = Array.new(20) { %w{1 2 4 8}.sample }
pat = "acgw"
rx = Regexp.new(pat.chars.map {|c| MAP[c.downcase]}.join)

seq.scan pat do |m|
  p m
end

Cheers

robert

···

On Tue, Apr 19, 2011 at 12:30 PM, Martin Hansen <mail@maasha.dk> wrote:

def match?(char1, char2)
(EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) != 0
end

That would be really easy to write tests and a benchmark against and
then rewrite in C using RubyInline... without tests and a benchmark tho,
you won't know that you've done it correctly and provides a measurable
benefit.

They bottlenecks are the iteration over the sequence (while loop at line
68) and the vector_update (line 120). I am a bit surprised that match?
is that slow - I expect it to be almost instantaneous in C. I would like
to test that with a benchmark and Inline C.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

What exact version? what platform? how was it installed?

$ uname -a
Linux linux116.ctc.com 2.6.32-71.24.1.el6.x86_64 #1 SMP Fri Apr 8
01:07:23 CDT 2011 x86_64 x86_64 x86_64 GNU/Linux

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]
120

$ cat j.rb
require "inline"

class MyTest
  inline do |builder|
    builder.c "
      long factorial(int max) {
        int i=max, result=1;
        while (i >= 2) { result *= i--; }
        return result;
      }"
  end
end

t = MyTest.new()

puts factorial_5 = t.factorial(5)

···

On Tue, Apr 19, 2011 at 6:30 AM, Martin Hansen <mail@maasha.dk> wrote:

And I just discovered RubyInline! Inline code is so much easier to deal
with that an external library IMHO. I have been messing a bit around
with "inline", but that turns out to be some old crappy gem!!! Its not
the first time I have missed the right gem :o( ...

However, I get a peculiar error when testing RubyInline with the first
of the examples:

http://pastie.org/1811057

I am on Ruby 1.9.2.

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

I am not sure if I understand this. I have tried to copy the behavior of
String#match.

Also #vector_update creates a lot of objects and does so for each
position in the sequence. That's likely where you can improve things.

Yes, that is quite possible. I might be able to skip .dup on line 128
and 130. That will require some thinking and testing on my side.

I am not sure what the matching algorithm is exactly. Can you summarize
it?

Well, it is a dynamic programming algorithm to do fuzzy searches of
patterns in strings - allowing for custom matching rules (A==N, etc) and
a maximum edit distance. Inspired by the paper by Bruno Woltzenlogel
Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

A short example:

http://pastie.org/1811496

you can make matching simpler

def match?(char1, char2)
    (EQUAL[char1.ord] & EQUAL[char2.ord]) != 0
end

Yes, but that should not give any significant speed increase?

You might as well consider changing the Array into a Hash. Then you
can even get rid of the #ord call.

Actually, I started with a hash for this - and it was slightly faster.
However, I think this bit field is very elegant - and since I was
preparing for porting to C - I think this is the way to go!

Cheers

Martin

···

--
Posted via http://www.ruby-forum.com/\.

unknown wrote in post #993757:

I am on Ruby 1.9.2.

What exact version? what platform? how was it installed?

$ uname -a
Darwin mel.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29
15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 i386

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
<internal:lib/rubygems/custom_require>:29:in `require': undefined
class/module Digest::Base (ArgumentError)
        from <internal:lib/rubygems/custom_require>:29:in `require'
        from
/Users/maasha/maasha_install/lib/ruby/gems/1.9.1/gems/RubyInline-3.9.0/lib/inline.rb:51:in
`<top (required)>'
        from <internal:lib/rubygems/custom_require>:33:in `require'
        from <internal:lib/rubygems/custom_require>:33:in `rescue in
require'
        from <internal:lib/rubygems/custom_require>:29:in `require'
        from j.rb:1:in `<main>'

Ruby I compiled and installed myself. Inline was installed: gem install
RubyInline

:o(

Martin

···

On Tue, Apr 19, 2011 at 6:30 AM, Martin Hansen <mail@maasha.dk> wrote:

--
Posted via http://www.ruby-forum.com/\.

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

I am not sure if I understand this. I have tried to copy the behavior of
String#match.

Maybe on the interface, but you create side effects on the String (Seq

Also #vector_update creates a lot of objects and does so for each
position in the sequence. That's likely where you can improve things.

Yes, that is quite possible. I might be able to skip .dup on line 128
and 130. That will require some thinking and testing on my side.

I am not sure what the matching algorithm is exactly. Can you summarize
it?

Well, it is a dynamic programming algorithm to do fuzzy searches of
patterns in strings - allowing for custom matching rules (A==N, etc) and
a maximum edit distance. Inspired by the paper by Bruno Woltzenlogel
Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

A short example:

http://pastie.org/1811496

you can make matching simpler

def match?(char1, char2)
(EQUAL[char1.ord] & EQUAL[char2.ord]) != 0
end

Yes, but that should not give any significant speed increase?

You might as well consider changing the Array into a Hash. Then you
can even get rid of the #ord call.

Actually, I started with a hash for this - and it was slightly faster.
However, I think this bit field is very elegant - and since I was
preparing for porting to C - I think this is the way to go!

I did not say you should get rid of the bit field! Plus, if you are
in need for speed and if it was slow then you should get rid of it
regardless of elegance.

Cheers

robert

···

On Tue, Apr 19, 2011 at 3:13 PM, Martin Hansen <mail@maasha.dk> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Sorry for the half message, thick fingers hit the wrong key accidentally.

IMHO it would be better to separate representation of the sequence and
the matching process. The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

I am not sure if I understand this. I have tried to copy the behavior of
String#match.

Maybe on the interface, but you create side effects on the String (Seq
in your case). This is neither a clean separation (makes classes
bigger and harder to understand) nor is it thread safe (e.g. if you
want to try to concurrently match several patterns against the same
sequence).

Btw, in your case I would probably implement #scan and derive
implement match like

def match(*args)
  scan(*args) do |match_data|
    return match_data
  end

  nil # no match
end

That's pretty easy and you can still replace this implementation with
something else if it should be necessary.

Also #vector_update creates a lot of objects and does so for each
position in the sequence. That's likely where you can improve things.

Yes, that is quite possible. I might be able to skip .dup on line 128
and 130. That will require some thinking and testing on my side.

You could at least switch to only two arrays which you use alternating
with relatively moderate effort. You might even get away with a
single array.

I am not sure what the matching algorithm is exactly. Can you summarize
it?

Well, it is a dynamic programming algorithm to do fuzzy searches of
patterns in strings - allowing for custom matching rules (A==N, etc) and
a maximum edit distance. Inspired by the paper by Bruno Woltzenlogel
Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

A short example:

http://pastie.org/1811496

Ah, interesting. Thanks!

you can make matching simpler

def match?(char1, char2)
(EQUAL[char1.ord] & EQUAL[char2.ord]) != 0
end

Yes, but that should not give any significant speed increase?

Why not? You get at least rid of a Ruby method call. Did you measure
differences?

You might as well consider changing the Array into a Hash. Then you
can even get rid of the #ord call.

Actually, I started with a hash for this - and it was slightly faster.
However, I think this bit field is very elegant - and since I was
preparing for porting to C - I think this is the way to go!

I did not say you should get rid of the bit field! Plus, if you are
in need for speed and if it was slow then you should get rid of it
regardless of elegance.

Kind regards

robert

···

On Tue, Apr 19, 2011 at 3:13 PM, Martin Hansen <mail@maasha.dk> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Thanks; I'll try to give it a try to replicate this from home later.

···

On Tue, Apr 19, 2011 at 10:12 AM, Martin Hansen <mail@maasha.dk> wrote:

unknown wrote in post #993757:

On Tue, Apr 19, 2011 at 6:30 AM, Martin Hansen <mail@maasha.dk> wrote:

I am on Ruby 1.9.2.

What exact version? what platform? how was it installed?

$ uname -a
Darwin mel.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29
15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 i386

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
<internal:lib/rubygems/custom_require>:29:in `require': undefined
class/module Digest::Base (ArgumentError)
from <internal:lib/rubygems/custom_require>:29:in `require'
from
/Users/maasha/maasha_install/lib/ruby/gems/1.9.1/gems/RubyInline-3.9.0/lib/inline.rb:51:in
`<top (required)>'
from <internal:lib/rubygems/custom_require>:33:in `require'
from <internal:lib/rubygems/custom_require>:33:in `rescue in
require'
from <internal:lib/rubygems/custom_require>:29:in `require'
from j.rb:1:in `<main>'

Ruby I compiled and installed myself. Inline was installed: gem install
RubyInline

unknown wrote in post #993757:

I am on Ruby 1.9.2.

What exact version? what platform? how was it installed?

$ uname -a
Darwin mel.local 10.7.0 Darwin Kernel Version 10.7.0: Sat Jan 29
15:17:16 PST 2011; root:xnu-1504.9.37~1/RELEASE_I386 i386 i386

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
<internal:lib/rubygems/custom_require>:29:in `require': undefined
class/module Digest::Base (ArgumentError)

$ uname -a
Darwin Mac-mini.local 10.6.0 Darwin Kernel Version 10.6.0: Wed Nov 10
18:13:17 PST 2010; root:xnu-1504.9.26~3/RELEASE_I386 i386

$ ruby -v j.rb
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.6.0]
120

ruby installed via homebrew

Ruby I compiled and installed myself.

Your Ruby was compiled under 10.7 or 10.6?

···

On Tue, Apr 19, 2011 at 10:12 AM, Martin Hansen <mail@maasha.dk> wrote:

On Tue, Apr 19, 2011 at 6:30 AM, Martin Hansen <mail@maasha.dk> wrote:

I hesitate to suggest this, given that if you've compiled and
installed Ruby yourself you're probably well aware of the various
possibilities. But: does the extension have to be in C and does it
have to be the Ruby you've compiled and installed? If not, and you're
having problems getting the C extensions you want to work with Ruby,
have you considered using JRuby and Java? I've found the integration
of Java with JRuby to be fairly easy: I have had some problems, but
that's due to my ignorance, not JRuby or Java, and I've usually found
a way round my problem quickly. And using JRuby with Java for the bits
that need to be fast has had a similar speed to compiled FreePascal.

···

On Tue, Apr 19, 2011 at 3:12 PM, Martin Hansen <mail@maasha.dk> wrote:

Ruby I compiled and installed myself. Inline was installed: gem install RubyInline

Your Ruby was compiled under 10.7 or 10.6?

I keep a log of the ruby compile. It looks like it was build under
10.6.0. The Mac OS here is 10.6.7 (there is no 10.7 out yet).

Is there a better place for this "bug report" than this thread?

Cheers

Martin

···

--
Posted via http://www.ruby-forum.com/\.

> Your Ruby was compiled under 10.7 or 10.6?

I keep a log of the ruby compile. It looks like it was build under
10.6.0. The Mac OS here is 10.6.7 (there is no 10.7 out yet).

Oh, okay. I wanted to check since your uname above said "10.7.0".

Is there a better place for this "bug report" than this thread?

Ruby Issue Tracking System: http://redmine.ruby-lang.org/

Can you run your test script with "ruby -d"?

···

Martin Hansen <mail@maasha.dk> wrote: