Regexp with gaps

egrasso · 14 July 2008 14:05

Hi, I need to find the position of some substrings inside of a long
string. For this I'm using a loop that uses str.index(pattern,
(last_found_position+1)) so I find all positions where the pattern
matches. The pattern is a string of 20 chars, different each time I
run the script. That worked perfect. The problem is that now I need to
find all positions where the pattern matches 12 or more chars.
For example: For the pattern "aaaaaa", find substrings "aaaaaa",
"aaabaa", "baaaaa", "ababaa", etc

First I thought that I could create all possible patterns (with \w)
and check them, but I realized that there would be a lot of different
patterns to check (over a few hundreds I think).
Is there any way to do this without the need of checking a lot of
patterns?
thanks

Phlip1 · 14 July 2008 14:15

The problem is that now I need to
find all positions where the pattern matches 12 or more chars.
For example: For the pattern "aaaaaa", find substrings "aaaaaa",
"aaabaa", "baaaaa", "ababaa", etc

First I thought that I could create all possible patterns (with \w)

\w{12,}

Right?

Either that or \w{12}\w*

egrasso · 15 July 2008 03:18

Mmmmm... nop. I think I didn't explain the idea very well... I'm writing a
script to find specific secuences of DNA (binding sites) inside of a large
secuence of DNA (for thosse who doesn't know, DNA sequences are made of 4
diferent bases: A, T, C and G). The problem is that the binding sites don't
need to be 100% exact to work. For example, the binding site for an X
protein is "AAATTT", but the protein can also bind to the secuence "AAAGTT"
or "AACGTT" and work fine. I need to find all this sites, but the only data
I have is that "Protein X binds to AAATTT".
I finally solve the problem without using str.index nor regexp, basically,
I seek it manually:

(Note: variables are in spanish!: buscarBS=find binding site,
patron=pattern, semejanza=1 to 0, minimal similarity, cadena=string,
respuesta=answer, largo=length)

def buscarBS(patron, semejanza=0.6, cadena=@secuencia)
  respuesta = ""
  i = 0.0
  j = 0.0
  largoc = cadena.length
  largop = patron.length

  while i <= (largoc-largop)
   j = 0.0
   puntos = 0.0
   subpuntos = largop * (1-semejanza)

   while (j < largop) and (subpuntos > 0)
    pos = i + j
    if cadena[pos] == patron[j] then
     puntos +=1
    else
     subpuntos -=1
    end
    j+=1
   end
   if (puntos / largop) >= semejanza then
     respuesta = respuesta + "desde: "+(i+1).to_i.to_s+" hasta:
"+(i+j).to_i.to_s+" - similitud: - "+(puntos / largop * 100).to_s+"%\n"
   end
   i+=1
  end

  if respuesta == "" then
   respuesta = "No se encontro ninguna secuencia similar (similitud:
#{semejanza} - #{patron})"
  else
   respuesta = "\nSe encontraron las siguientes similitudes:\n\n"+respuesta
  end
  return respuesta

end

I still need to polish and optimize the code but it find all possible
sites with at least an specific similarity and tells me how similar they
are. If anyone have another idea, need more details about the code or is
interested in bioinformatic with ruby tell me.
Thanks

···

On Mon, 14 Jul 2008 23:15:30 +0900, phlip <phlip2005@gmail.com> wrote:

The problem is that now I need to
find all positions where the pattern matches 12 or more chars.
For example: For the pattern "aaaaaa", find substrings "aaaaaa",
"aaabaa", "baaaaa", "ababaa", etc

First I thought that I could create all possible patterns (with \w)

\w{12,}

Right?

Either that or \w{12}\w*

Axel_Etzold · 15 July 2008 09:10

-------- Original-Nachricht --------

Datum: Tue, 15 Jul 2008 12:18:09 +0900
Von: egrasso.rb@eng2.net
An: ruby-talk@ruby-lang.org
Betreff: Re: regexp with gaps

Mmmmm... nop. I think I didn't explain the idea very well... I'm writing a
script to find specific secuences of DNA (binding sites) inside of a large
secuence of DNA (for thosse who doesn't know, DNA sequences are made of 4
diferent bases: A, T, C and G). The problem is that the binding sites
don't
need to be 100% exact to work. For example, the binding site for an X
protein is "AAATTT", but the protein can also bind to the secuence
"AAAGTT"
or "AACGTT" and work fine. I need to find all this sites, but the only
data
I have is that "Protein X binds to AAATTT".
I finally solve the problem without using str.index nor regexp, basically,
I seek it manually:

(Note: variables are in spanish!: buscarBS=find binding site,
patron=pattern, semejanza=1 to 0, minimal similarity, cadena=string,
respuesta=answer, largo=length)

def buscarBS(patron, semejanza=0.6, cadena=@secuencia)
  respuesta = ""
  i = 0.0
  j = 0.0
  largoc = cadena.length
  largop = patron.length

  while i <= (largoc-largop)
   j = 0.0
   puntos = 0.0
   subpuntos = largop * (1-semejanza)

   while (j < largop) and (subpuntos > 0)
    pos = i + j
    if cadena[pos] == patron[j] then
     puntos +=1
    else
     subpuntos -=1
    end
    j+=1
   end
   if (puntos / largop) >= semejanza then
     respuesta = respuesta + "desde: "+(i+1).to_i.to_s+" hasta:
"+(i+j).to_i.to_s+" - similitud: - "+(puntos / largop * 100).to_s+"%\n"
   end
   i+=1
  end

  if respuesta == "" then
   respuesta = "No se encontro ninguna secuencia similar (similitud:
#{semejanza} - #{patron})"
  else
   respuesta = "\nSe encontraron las siguientes
similitudes:\n\n"+respuesta
  end
  return respuesta

end

I still need to polish and optimize the code but it find all possible
sites with at least an specific similarity and tells me how similar they
are. If anyone have another idea, need more details about the code or is
interested in bioinformatic with ruby tell me.
Thanks

>> The problem is that now I need to
>> find all positions where the pattern matches 12 or more chars.
>> For example: For the pattern "aaaaaa", find substrings "aaaaaa",
>> "aaabaa", "baaaaa", "ababaa", etc
>>
>> First I thought that I could create all possible patterns (with \w)
>
> \w{12,}
>
> Right?
>
> Either that or \w{12}\w*

Hi ---

you could make use of the McIlroy-Hunt longest common subsequence (LCS) algorithm,
which will give you longest common subsequences, and also information of the type

'sequence AAATTT is transformed into AAAGTT by changing T to G at the fourth entry.'

You can find a Ruby gem implementation here: http://raa.ruby-lang.org/project/diff-lcs/

Best regards,

Axel

···

On Mon, 14 Jul 2008 23:15:30 +0900, phlip <phlip2005@gmail.com> wrote:

--
Psssst! Schon das coole Video vom GMX MultiMessenger gesehen?
Der Eine für Alle: GMX Produkte Übersicht: Apps, Browser, MailCheck und Co.

Topic		Replies	Views
Fastest way to search for substrings in strings ruby-talk	10	173	24 October 2014
Finding positions in a string ruby-talk	7	122	24 July 2007
Matching substrings ruby-talk	9	116	28 April 2012
Search pattern ruby-talk	7	89	11 January 2008
Regexp#match(str, offset) ruby-talk	3	85	30 September 2004

Regexp with gaps

Related topics