Fastest way to search for substrings in strings

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this?

There are some fast string search algorithms out there (Boyer-Moore
etc.) but I guess in this case regex is the fastest than implementing
any of those in Ruby. (May be different with a C extension though.)

IIRC there are some libs for bioinformatics. Maybe here:
https://rubygems.org/gems?letter=D&page=91
https://rubygems.org/search?utf8=✓&query=dna

Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Nitpick: if you know the start you automatically know the end because
you have the length of the substring already. :slight_smile:

Kind regards

robert

···

On Thu, Oct 9, 2014 at 4:42 PM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/

Hi Panagiotis,

There are some options for your problem, depending on the details, some may
be better suited for your application than others. I do not know what the
acronym 'PoC' stand for, so if something special is implied that I've
overlooked, sorry.

DNA sequence alignment is a problem with many tools to solve it. However,
using ruby there are few general options, like using the BioRuby framework
for parsing output of alignment programs. Not sure this is the best option
for you, but maybe worth looking into.

You mentioned that you want to "... search substrings in large strings...",
this sounds like you want to perform fitting alignment [1]. If this is the
case, I'm not aware of many options, which is why I made a gem for this
purpose [2]. It should be reasonably fast, since it uses the C++ SeqAn
library [3] for the actual alignments. Please note, I made this gem for my
own purposes, and I don't believe anyone else has actually used it, so use
at your own risk.

If in fact you are doing something more specific, such as if you are trying
to map reads to a reference genome, I'd suggest using a proper read mapper,
like bowtie [4]. You can then parse the output in ruby. There may be some
gems to help with this, but it's simple enough to write one for yourself
(that's what I did). I just found another gem [5], but have never tried it.
This may also work for your purpose.

Hope this helps.

Cheers.

References

···

-------------------
[1] http://rosalind.info/glossary/fitting-alignment/
[2] bioseqalign | RubyGems.org | your community gem host
[3] http://www.seqan.de/
[4] Bowtie: An ultrafast, memory-efficient short read aligner
[5] bio-bwa | RubyGems.org | your community gem host

On Thu, Oct 9, 2014 at 7:42 AM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to
search substrings in large strings quickly. What's the best to approach
this? I''m thinking of using simply regular expressions. Is there any
external library that deals with this? Other than the fact that the sample
string contains the sub-string, I need to know the exact position of the
substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery [...]" - C. P. Cavafy

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring that don’t have easily detectable delimiters, you can easily find yourself in exponential hell with regular expressions. You would have to exercise extra care to minimize the need for backtracking in your patterns. I find that this often yields long and hard-to-read regular expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally like ragel[1], which outputs ruby code and is extremely fast. It has several options for fine tuning the output for code-size/performance tradeoffs.

It appears that PoC could also mean Point-of-Care in the context of DNA data apps.

Regards,
Ammar

[1] Ragel State Machine Compiler

···

On Oct 9, 2014, at 5:42 PM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

I'm fairly new to Ruby but from a naive understanding what's wrong with String#index ?

gvim

···

On 09/10/2014 15:42, Panagiotis Atmatzidis wrote:

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

I do not know what the acronym 'PoC' stand for,

"Proof of Concept"

Peter

···

On Sun, Oct 19, 2014 at 7:03 AM, Stefano Bonissone <stefano.rb@gmail.com> wrote:

Before resorting something more complicated I would prove that the
simple thing is actually bad. Otherwise you'll end up complicating
things unnecessary. :slight_smile:

Kind regards

robert

···

On Sun, Oct 19, 2014 at 12:36 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring that don’t have easily detectable delimiters, you can easily find yourself in exponential hell with regular expressions. You would have to exercise extra care to minimize the need for backtracking in your patterns. I find that this often yields long and hard-to-read regular expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally like ragel[1], which outputs ruby code and is extremely fast. It has several options for fine tuning the output for code-size/performance tradeoffs.

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/

Ciao Stefano!

I get from the name that you’re Italian. My mother is from Sicily, my second "mother tongue" is Italian :slight_smile:

Hi Panagiotis,

There are some options for your problem, depending on the details, some may be better suited for your application than others. I do not know what the acronym 'PoC' stand for, so if something special is implied that I've overlooked, sorry.

As Peter said, stands for ‘Proof of Concept’.

DNA sequence alignment is a problem with many tools to solve it. However, using ruby there are few general options, like using the BioRuby framework for parsing output of alignment programs. Not sure this is the best option for you, but maybe worth looking into.

I’m creating a proof of concept Sinatra application to perform basic sequence alignment for my thesis (I’m about to get a degree in Pharmacy). I’m using the bioruby library for translation (DNA to protein). The tool will only deal with bacterias to avoid complications such as slicing and huge sequences (e.g. human genome).

You mentioned that you want to "... search substrings in large strings...", this sounds like you want to perform fitting alignment [1]. If this is the case, I'm not aware of many options, which is why I made a gem for this purpose [2]. It should be reasonably fast, since it uses the C++ SeqAn library [3] for the actual alignments. Please note, I made this gem for my own purposes, and I don't believe anyone else has actually used it, so use at your own risk.

Thanks for the pointers! I didn’t knew about the term ‘fitting alignment’ although I’ve read several papers about history of sequencing (from the 70s till today). That’s exactly what I need :slight_smile: Thanks for the the gem, didn’t knew about it! Will this gem work with protein sequences too? Say I’m translating a DNA sequence to proteins (find ORF’s on every strand with length > X, then compare findings with my sample?).

If in fact you are doing something more specific, such as if you are trying to map reads to a reference genome, I'd suggest using a proper read mapper, like bowtie [4]. You can then parse the output in ruby. There may be some gems to help with this, but it's simple enough to write one for yourself (that's what I did). I just found another gem [5], but have never tried it. This may also work for your purpose.

Hope this helps.

I don’t think I will need something as specific as bootie. Thanks for your help and gem!

Cheers.

References
-------------------
[1] ROSALIND | Glossary | Fitting alignment
[2] bioseqalign | RubyGems.org | your community gem host
[3] http://www.seqan.de/
[4] Bowtie: An ultrafast, memory-efficient short read aligner
[5] bio-bwa | RubyGems.org | your community gem host

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org <mailto:atma@convalesco.org>
URL: http://www.convalesco.org <http://www.convalesco.org/&gt;
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu <http://pgp.mit.edu/&gt; --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

···

On 19 Oct 2014, at 08:03, Stefano Bonissone <stefano.rb@gmail.com> wrote:
On Thu, Oct 9, 2014 at 7:42 AM, Panagiotis Atmatzidis <atma@convalesco.org <mailto:atma@convalesco.org>> wrote:

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring that don’t have easily detectable delimiters, you can easily find yourself in exponential hell with regular expressions. You would have to exercise extra care to minimize the need for backtracking in your patterns. I find that this often yields long and hard-to-read regular expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally like ragel[1], which outputs ruby code and is extremely fast. It has several options for fine tuning the output for code-size/performance tradeoffs.

It appears that PoC could also mean Point-of-Care in the context of DNA data apps.

Regards,
Ammar

[1] http://www.colm.net/open-source/ragel/

Thanks for the hint, I’ll take a look at “ragel”!

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

···

On 19 Oct 2014, at 13:36, Ammar Ali <ammarabuali@gmail.com> wrote:
On Oct 9, 2014, at 5:42 PM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

I completely agree. :slight_smile:

Not having seen what the data in question looks like, I just threw another idea out there. If a simpler approach with less external dependencies and moving parts does the job, then hallelujah!

Cheers,
Ammar

···

On Oct 19, 2014, at 9:06 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

Before resorting something more complicated I would prove that the
simple thing is actually bad. Otherwise you'll end up complicating
things unnecessary. :slight_smile:

Ciao Stefano!

I get from the name that you’re Italian. My mother is from Sicily, my
second "mother tongue" is Italian :slight_smile:

:slight_smile:

Hi Panagiotis,

There are some options for your problem, depending on the details, some
may be better suited for your application than others. I do not know what
the acronym 'PoC' stand for, so if something special is implied that I've
overlooked, sorry.

As Peter said, stands for ‘Proof of Concept’.

DNA sequence alignment is a problem with many tools to solve it. However,
using ruby there are few general options, like using the BioRuby framework
for parsing output of alignment programs. Not sure this is the best option
for you, but maybe worth looking into.

I’m creating a proof of concept Sinatra application to perform basic
sequence alignment for my thesis (I’m about to get a degree in Pharmacy).
I’m using the bioruby library for translation (DNA to protein). The tool
will only deal with bacterias to avoid complications such as slicing and
huge sequences (e.g. human genome).

I see, nice.

You mentioned that you want to "... search substrings in large
strings...", this sounds like you want to perform fitting alignment [1]. If
this is the case, I'm not aware of many options, which is why I made a gem
for this purpose [2]. It should be reasonably fast, since it uses the C++
SeqAn library [3] for the actual alignments. Please note, I made this gem
for my own purposes, and I don't believe anyone else has actually used it,
so use at your own risk.

Thanks for the pointers! I didn’t knew about the term ‘fitting alignment’
although I’ve read several papers about history of sequencing (from the 70s
till today). That’s exactly what I need :slight_smile: Thanks for the the gem, didn’t
knew about it! Will this gem work with protein sequences too? Say I’m
translating a DNA sequence to proteins (find ORF’s on every strand with
length > X, then compare findings with my sample?).

Yes, the gem will work with protein sequences, you can specify

match/mismatch/indel penalties. Really though, for protein sequences you
should use a BLOSUM matrix or the like (http://en.wikipedia.org/wiki/BLOSUM\),
something that isn't currently supported. You should be able to modify it
to work fairly easily, using seqan.

If in fact you are doing something more specific, such as if you are
trying to map reads to a reference genome, I'd suggest using a proper read
mapper, like bowtie [4]. You can then parse the output in ruby. There may
be some gems to help with this, but it's simple enough to write one for
yourself (that's what I did). I just found another gem [5], but have never
tried it. This may also work for your purpose.

Hope this helps.

I don’t think I will need something as specific as bootie. Thanks for your
help and gem!

Ok.

Cheers.

···

On Sun, Oct 19, 2014 at 12:58 PM, Panagiotis Atmatzidis <atma@convalesco.org > wrote:

On 19 Oct 2014, at 08:03, Stefano Bonissone <stefano.rb@gmail.com> wrote:

Cheers.

References
-------------------
[1] ROSALIND | Glossary | Fitting alignment
[2] https://rubygems.org/gems/bioseqalign
[3] http://www.seqan.de/
[4] Bowtie: An ultrafast, memory-efficient short read aligner
[5] bio-bwa | RubyGems.org | your community gem host

On Thu, Oct 9, 2014 at 7:42 AM, Panagiotis Atmatzidis <atma@convalesco.org > > wrote:

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to
search substrings in large strings quickly. What's the best to approach
this? I''m thinking of using simply regular expressions. Is there any
external library that deals with this? Other than the fact that the sample
string contains the sub-string, I need to know the exact position of the
substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery [...]" - C. P. Cavafy

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery [...]" - C. P. Cavafy