Fastest way to search for substrings in strings

Panagiotis_Atmatzidi · 9 October 2014 14:42

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

Robert_K1 · 9 October 2014 19:00

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this?

There are some fast string search algorithms out there (Boyer-Moore
etc.) but I guess in this case regex is the fastest than implementing
any of those in Ruby. (May be different with a C extension though.)

IIRC there are some libs for bioinformatics. Maybe here:
https://rubygems.org/gems?letter=D&page=91
https://rubygems.org/search?utf8=✓&query=dna

Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Nitpick: if you know the start you automatically know the end because
you have the length of the substring already.

Kind regards

robert

···

On Thu, Oct 9, 2014 at 4:42 PM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/

Stefano_Bonissone · 19 October 2014 05:03

Hi Panagiotis,

There are some options for your problem, depending on the details, some may
be better suited for your application than others. I do not know what the
acronym 'PoC' stand for, so if something special is implied that I've
overlooked, sorry.

DNA sequence alignment is a problem with many tools to solve it. However,
using ruby there are few general options, like using the BioRuby framework
for parsing output of alignment programs. Not sure this is the best option
for you, but maybe worth looking into.

You mentioned that you want to "... search substrings in large strings...",
this sounds like you want to perform fitting alignment [1]. If this is the
case, I'm not aware of many options, which is why I made a gem for this
purpose [2]. It should be reasonably fast, since it uses the C++ SeqAn
library [3] for the actual alignments. Please note, I made this gem for my
own purposes, and I don't believe anyone else has actually used it, so use
at your own risk.

If in fact you are doing something more specific, such as if you are trying
to map reads to a reference genome, I'd suggest using a proper read mapper,
like bowtie [4]. You can then parse the output in ruby. There may be some
gems to help with this, but it's simple enough to write one for yourself
(that's what I did). I just found another gem [5], but have never tried it.
This may also work for your purpose.

Hope this helps.

Cheers.

References

···

-------------------
[1] http://rosalind.info/glossary/fitting-alignment/
[2] bioseqalign | RubyGems.org | your community gem host
[3] http://www.seqan.de/
[4] Bowtie: An ultrafast, memory-efficient short read aligner
[5] bio-bwa | RubyGems.org | your community gem host

On Thu, Oct 9, 2014 at 7:42 AM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to
search substrings in large strings quickly. What's the best to approach
this? I''m thinking of using simply regular expressions. Is there any
external library that deals with this? Other than the fact that the sample
string contains the sub-string, I need to know the exact position of the
substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery [...]" - C. P. Cavafy

Ammar_Ali · 19 October 2014 10:36

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring that don’t have easily detectable delimiters, you can easily find yourself in exponential hell with regular expressions. You would have to exercise extra care to minimize the need for backtracking in your patterns. I find that this often yields long and hard-to-read regular expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally like ragel[1], which outputs ruby code and is extremely fast. It has several options for fine tuning the output for code-size/performance tradeoffs.

It appears that PoC could also mean Point-of-Care in the context of DNA data apps.

Regards,
Ammar

[1] Ragel State Machine Compiler

···

On Oct 9, 2014, at 5:42 PM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

gvim · 19 October 2014 22:51

I'm fairly new to Ruby but from a naive understanding what's wrong with String#index ?

gvim

···

On 09/10/2014 15:42, Panagiotis Atmatzidis wrote:

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

Peter_Vandenabeele1 · 19 October 2014 08:45

I do not know what the acronym 'PoC' stand for,

"Proof of Concept"

Peter

···

On Sun, Oct 19, 2014 at 7:03 AM, Stefano Bonissone <stefano.rb@gmail.com> wrote:

Robert_K1 · 19 October 2014 18:06

Before resorting something more complicated I would prove that the
simple thing is actually bad. Otherwise you'll end up complicating
things unnecessary.

Kind regards

robert

···

On Sun, Oct 19, 2014 at 12:36 PM, Ammar Ali <ammarabuali@gmail.com> wrote:

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring that don’t have easily detectable delimiters, you can easily find yourself in exponential hell with regular expressions. You would have to exercise extra care to minimize the need for backtracking in your patterns. I find that this often yields long and hard-to-read regular expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally like ragel[1], which outputs ruby code and is extremely fast. It has several options for fine tuning the output for code-size/performance tradeoffs.

--
[guy, jim].each {|him| remember.him do |as, often| as.you_can - without end}
http://blog.rubybestpractices.com/

Panagiotis_Atmatzidi · 19 October 2014 19:58

Ciao Stefano!

I get from the name that you’re Italian. My mother is from Sicily, my second "mother tongue" is Italian

Hi Panagiotis,

There are some options for your problem, depending on the details, some may be better suited for your application than others. I do not know what the acronym 'PoC' stand for, so if something special is implied that I've overlooked, sorry.

As Peter said, stands for ‘Proof of Concept’.

DNA sequence alignment is a problem with many tools to solve it. However, using ruby there are few general options, like using the BioRuby framework for parsing output of alignment programs. Not sure this is the best option for you, but maybe worth looking into.

I’m creating a proof of concept Sinatra application to perform basic sequence alignment for my thesis (I’m about to get a degree in Pharmacy). I’m using the bioruby library for translation (DNA to protein). The tool will only deal with bacterias to avoid complications such as slicing and huge sequences (e.g. human genome).

You mentioned that you want to "... search substrings in large strings...", this sounds like you want to perform fitting alignment [1]. If this is the case, I'm not aware of many options, which is why I made a gem for this purpose [2]. It should be reasonably fast, since it uses the C++ SeqAn library [3] for the actual alignments. Please note, I made this gem for my own purposes, and I don't believe anyone else has actually used it, so use at your own risk.

Thanks for the pointers! I didn’t knew about the term ‘fitting alignment’ although I’ve read several papers about history of sequencing (from the 70s till today). That’s exactly what I need Thanks for the the gem, didn’t knew about it! Will this gem work with protein sequences too? Say I’m translating a DNA sequence to proteins (find ORF’s on every strand with length > X, then compare findings with my sample?).

If in fact you are doing something more specific, such as if you are trying to map reads to a reference genome, I'd suggest using a proper read mapper, like bowtie [4]. You can then parse the output in ruby. There may be some gems to help with this, but it's simple enough to write one for yourself (that's what I did). I just found another gem [5], but have never tried it. This may also work for your purpose.

Hope this helps.

I don’t think I will need something as specific as bootie. Thanks for your help and gem!

Cheers.

References
-------------------
[1] ROSALIND | Glossary | Fitting alignment
[2] bioseqalign | RubyGems.org | your community gem host
[3] http://www.seqan.de/
[4] Bowtie: An ultrafast, memory-efficient short read aligner
[5] bio-bwa | RubyGems.org | your community gem host

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org <mailto:atma@convalesco.org>
URL: http://www.convalesco.org <http://www.convalesco.org/>
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu <http://pgp.mit.edu/> --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

···

On 19 Oct 2014, at 08:03, Stefano Bonissone <stefano.rb@gmail.com> wrote:
On Thu, Oct 9, 2014 at 7:42 AM, Panagiotis Atmatzidis <atma@convalesco.org <mailto:atma@convalesco.org>> wrote:

Panagiotis_Atmatzidi · 19 October 2014 20:00

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to search substrings in large strings quickly. What's the best to approach this? I''m thinking of using simply regular expressions. Is there any external library that deals with this? Other than the fact that the sample string contains the sub-string, I need to know the exact position of the substring (where it starts where it ends).

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring that don’t have easily detectable delimiters, you can easily find yourself in exponential hell with regular expressions. You would have to exercise extra care to minimize the need for backtracking in your patterns. I find that this often yields long and hard-to-read regular expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally like ragel[1], which outputs ruby code and is extremely fast. It has several options for fine tuning the output for code-size/performance tradeoffs.

It appears that PoC could also mean Point-of-Care in the context of DNA data apps.

Regards,
Ammar

[1] http://www.colm.net/open-source/ragel/

Thanks for the hint, I’ll take a look at “ragel”!

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of adventure, full of discovery [...]" - C. P. Cavafy

···

On 19 Oct 2014, at 13:36, Ammar Ali <ammarabuali@gmail.com> wrote:
On Oct 9, 2014, at 5:42 PM, Panagiotis Atmatzidis <atma@convalesco.org> wrote:

Ammar_Ali · 19 October 2014 19:57

I completely agree.

Not having seen what the data in question looks like, I just threw another idea out there. If a simpler approach with less external dependencies and moving parts does the job, then hallelujah!

Cheers,
Ammar

···

On Oct 19, 2014, at 9:06 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

Before resorting something more complicated I would prove that the
simple thing is actually bad. Otherwise you'll end up complicating
things unnecessary.

Stefano_Bonissone · 24 October 2014 05:18

Ciao Stefano!

I get from the name that you’re Italian. My mother is from Sicily, my
second "mother tongue" is Italian

Hi Panagiotis,

There are some options for your problem, depending on the details, some
may be better suited for your application than others. I do not know what
the acronym 'PoC' stand for, so if something special is implied that I've
overlooked, sorry.

As Peter said, stands for ‘Proof of Concept’.

DNA sequence alignment is a problem with many tools to solve it. However,
using ruby there are few general options, like using the BioRuby framework
for parsing output of alignment programs. Not sure this is the best option
for you, but maybe worth looking into.

I’m creating a proof of concept Sinatra application to perform basic
sequence alignment for my thesis (I’m about to get a degree in Pharmacy).
I’m using the bioruby library for translation (DNA to protein). The tool
will only deal with bacterias to avoid complications such as slicing and
huge sequences (e.g. human genome).

I see, nice.

You mentioned that you want to "... search substrings in large
strings...", this sounds like you want to perform fitting alignment [1]. If
this is the case, I'm not aware of many options, which is why I made a gem
for this purpose [2]. It should be reasonably fast, since it uses the C++
SeqAn library [3] for the actual alignments. Please note, I made this gem
for my own purposes, and I don't believe anyone else has actually used it,
so use at your own risk.

Thanks for the pointers! I didn’t knew about the term ‘fitting alignment’
although I’ve read several papers about history of sequencing (from the 70s
till today). That’s exactly what I need Thanks for the the gem, didn’t
knew about it! Will this gem work with protein sequences too? Say I’m
translating a DNA sequence to proteins (find ORF’s on every strand with
length > X, then compare findings with my sample?).

Yes, the gem will work with protein sequences, you can specify

match/mismatch/indel penalties. Really though, for protein sequences you
should use a BLOSUM matrix or the like (http://en.wikipedia.org/wiki/BLOSUM\),
something that isn't currently supported. You should be able to modify it
to work fairly easily, using seqan.

If in fact you are doing something more specific, such as if you are
trying to map reads to a reference genome, I'd suggest using a proper read
mapper, like bowtie [4]. You can then parse the output in ruby. There may
be some gems to help with this, but it's simple enough to write one for
yourself (that's what I did). I just found another gem [5], but have never
tried it. This may also work for your purpose.

Hope this helps.

I don’t think I will need something as specific as bootie. Thanks for your
help and gem!

Ok.

Cheers.

···

On Sun, Oct 19, 2014 at 12:58 PM, Panagiotis Atmatzidis <atma@convalesco.org > wrote:

On 19 Oct 2014, at 08:03, Stefano Bonissone <stefano.rb@gmail.com> wrote:

Cheers.

References
-------------------
[1] ROSALIND | Glossary | Fitting alignment
[2] https://rubygems.org/gems/bioseqalign
[3] http://www.seqan.de/
[4] Bowtie: An ultrafast, memory-efficient short read aligner
[5] bio-bwa | RubyGems.org | your community gem host

On Thu, Oct 9, 2014 at 7:42 AM, Panagiotis Atmatzidis <atma@convalesco.org > > wrote:

Hello,

I'm building a PoC DNA sequence alignment tool. I need to be able to
search substrings in large strings quickly. What's the best to approach
this? I''m thinking of using simply regular expressions. Is there any
external library that deals with this? Other than the fact that the sample
string contains the sub-string, I need to know the exact position of the
substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery [...]" - C. P. Cavafy

Panagiotis (atmosx) Atmatzidis

email: atma@convalesco.org
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

"As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery [...]" - C. P. Cavafy

Topic		Replies	Views
Finding positions in a string ruby-talk	7	122	24 July 2007
Regexp with gaps ruby-talk	3	77	15 July 2008
Substring ruby-talk	3	257	13 March 2017
Start of string? ruby-talk	6	73	6 November 2007
Identify and extract positions from a string - how to? ruby-talk	11	112	19 July 2007

Fastest way to search for substrings in strings

Related topics