Search for a string in another string allowing mismatches

Hi all,

I'm sure this is a very common problem. But can't find a simple &
efficient way of solving it.

Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here's an example:

query string:
"acgt"

subject string
"acctaggg"

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match "acct".
If 2 mismatches were allowed, the query string would match "acct" and
"aggg".

If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: /[ac][c][cg][t][acgt][c][gt][t]/

Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.

I'd be very grateful for any suggestions!

Cheers,
Janus

···

--
Posted via http://www.ruby-forum.com/.

Hi all,

I'm sure this is a very common problem. But can't find a simple &
efficient way of solving it.

Hmm. I wouldn't think this is common at all. I have dealt with a somewhat similar situation in the past in two ways:

1. Let the user do partial searches.

2. Using the 'soundex' tools in a database

Neither of those matches your situation, which appears to be scanning for near-matches in genetic material.

Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.

This certainly isn't my field of expertise, but I don't expect a fuzzy-logic search and 'performance' to be particularly compatible. Hope I'm wrong!

If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: /[ac][c][cg][t][acgt][c][gt][t]/

Gah.

What came to my mind was something like

  fuzzysearch("agct", 1)

where "1" is the number of mismatches allowed. The code in question would run the following searches:
  /.gct/
  /a.ct/
  /ag.t/
  /agc./

for fuzzysearch("agct",2) it would run
  /..ct/
  /.g.t/
  /.gc./
  /a..t/
  /a.c./
  /ag../

I have no idea what the overhead of wildcards are; if there's some ultra-simple ultra-fast string matching tool somewhere that you can call from Ruby or invoke through a command line, then you could take advantage of the tiny set of possible alternatives, and break the fuzzy search into 13 static searches:
  agct
  cgct
  ggct
  tgct
  aact
  acct
  atct
  .
  .
  .
  agca
  agcc
  agcg

···

On Sep 21, 2010, at 11:07 , Janus Bor wrote:

This does not use regular expressions.
It does not tell you which characters match. It tells you how many.

I don't know if it is useful or fast enough but maybe it will give you
some ideas.

str = "acctaggg"

(0..str.length-4).each do |y|
  p str.unpack("@#{y}a4")[0].split(//).zip("acgt".split(//)).map{|r|
r[0]<=>r[1]}.select{|f| f==0}.size
end

Harry

···

On Wed, Sep 22, 2010 at 3:07 AM, Janus Bor <janus@urban-youth.com> wrote:

Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here's an example:

query string:
"acgt"

subject string
"acctaggg"

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match "acct".
If 2 mismatches were allowed, the query string would match "acct" and
"aggg".

A little modification of my previous post.
Again, (lack of) speed may be a problem.

class String
  def fuz(m,n) # (string to match, number of matches)
    a =
    (0..size-4).each do |y|
      t = self[y,4]
      if t.split(//).zip(m.split(//)).map{|r| r[0]<=>r[1]}.select{|f|
f==0}.size>=n
        a << t
      end
    end
    a
  end
end

str = "acctaggg"

p str.fuz("acgt",2) #> ["acct", "aggg"]

Harry

···

On Wed, Sep 22, 2010 at 9:47 PM, Harry Kakueki <list.push@gmail.com> wrote:

On Wed, Sep 22, 2010 at 3:07 AM, Janus Bor <janus@urban-youth.com> wrote:

Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here's an example:

query string:
"acgt"

subject string
"acctaggg"

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match "acct".
If 2 mismatches were allowed, the query string would match "acct" and
"aggg".

This does not use regular expressions.
It does not tell you which characters match. It tells you how many.

I don't know if it is useful or fast enough but maybe it will give you
some ideas.

str = "acctaggg"

(0..str.length-4).each do |y|
p str.unpack("@#{y}a4")[0].split(//).zip("acgt".split(//)).map{|r|
r[0]<=>r[1]}.select{|f| f==0}.size
end

Harry