[QUIZ] Longest Repeated Substring (#153)

I agree on GPL3.

How much whitespace followed your GPL2 result? I ended up with
" 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA\n"
which is 57 characters. Maybe just a difference in the text files we
used? I used 'ruby-1.8.6/GPL'.

···

On Jan 18, 3:54 pm, tho_mica_l <micat...@gmail.com> wrote:

> First testcase:
> "your banana my banana" should give " banana"

For the GNU GENERAL PUBLIC LICENSE Version 2, June 1991, I get:

"customarily used for software interchange; or, "... some more
whitespace

For the GPL3 I get:

") Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by "...

If somebody can verify this.

I think this really starts to make fun when running the script over
Mark Twains entire work (from Gutenberg) or something similar.

i would think that

'aaaaaa'.longest_repeating_substring #=> 'aaaaa'

the quiz did not say that the two strings could not overlap.

is this correct james?

a @ http://codeforpeople.com/

···

On Jan 18, 2008, at 2:00 PM, yermej wrote:

On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:

First testcase:
"your banana my banana" should give " banana"

--Ken

--
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.http://www.iit.edu/~kbloom1/

And a second:
"aaaaaa" should give "aaa"

Right?

--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

I think you're right in that it's wrong. :wink:

If you submit the string "ababab" to your program, it comes back with
"aba" as the longest non-overlapping substring, but the answer should
be "ab". When you compare the consecutive sorted suffixes "abab" and
"ababab", you allow strings up to 3 ("ababab".size / 2) to be used,
but in fact, they overlap in all but 2 characters.

I'll post my solution in a reply, which is very similar to your except
in the overlap prevention code, which, I have to admit, is pretty
ugly. And I'm not even convinced that I got it right!

Eric

···

On Jan 20, 12:04 pm, Dave Thomas <d...@pragprog.com> wrote:

I wouldn't be surprised if the idea of searching only 1/2 of the
second string to prevent overlaps is wrong.. :slight_smile:

Hi Dave,

Thanks for the brilliant idea--this time I really feel like I've
learned something new! :slight_smile:

I even took my time to implement it in C (80 lines) and I must admit
that ruby performs very well in this case: my C version is only about
6 times faster than your ruby...

However, with the algorithm presented, I don't see how one can obtain
the correct result for the "aaaa" case. Lets see; the suffix list
would be:

aaaa
aaa
aa
a

now, sorted:

a
aa
aaa
aaaa

And the catch is--in order to obtain the correct result, that is "aa",
one needs to observe "aa" and "aaaa", but these are not adjacent. In
this case two pairs of adjacent suffixes ("aa" and "aaa", "aaa" and
"aaaa") might give "aa", but they do overlap.

So do we need a special-case treatment here?

···

On Jan 20, 7:04 pm, Dave Thomas <d...@pragprog.com> wrote:

My hack at the substring problem (nice one, James) is based on suffix
trees.

Suffix trees and suffix arrays and a great tool for substring
operations. The idea is to create a list of all the possible suffixes
in the original string. For "banana" this would be

banana
anana
nana
ana
na
a

Because the list contains an entry starting at every position in the
string, we know that all possible substrings of the original string
must occur at the start of one of these list elements. The substring
"nan", for example, occurs at the start of the 3rd entry.

Now sort them

a
ana
anana
banana
na
nana

Now we know that all substrings that start with the same sequence of
characters must occur at the start of items in the list that are
adjacent. Searching through to list to find these is a linear operation.

--
Cheers,
Alex

It should, otherwise "banana" would give "ana" rather than "an".

Or even better it would be the same string.

···

--
Radosław Bułat

http://radarek.jogger.pl - mój blog

Hopefully both. :slight_smile:

James Edward Gray II

···

On Jan 18, 2008, at 3:10 PM, Radosław Bułat wrote:

On Jan 18, 2008 10:00 PM, yermej <yermej@gmail.com> wrote:

On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:

And a second:
"aaaaaa" should give "aaa"

Right?

It should, otherwise "banana" would give "ana" rather than "an".

My question is (I'm not familiar with RubyQuiz too much :)): episode
focus on algorithm (speed) or source code (readable)?

Normally this is not very important all aspects can be of interest but
please be aware that this time James has explicitly asked for
solutions that are reasonable fast for longer input.
Great to have you join in BTW. The most important thing is to
participate of course...

Cheers
Robert

···

2008/1/18 Radosław Bułat <radek.bulat@gmail.com>:

On Jan 18, 2008 10:00 PM, yermej <yermej@gmail.com> wrote:
> On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:

> And a second:
> "aaaaaa" should give "aaa"
>
> Right?

It should, otherwise "banana" would give "ana" rather than "an".

My question is (I'm not familiar with RubyQuiz too much :)): episode
focus on algorithm (speed) or source code (readable)?

--
http://ruby-smalltalk.blogspot.com/

---
Whereof one cannot speak, thereof one must be silent.
Ludwig Wittgenstein

" 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA\n"

It seems there is another version of this document around (at least on
my harddisk) in which the line in the header reads " 59 Temple Place,
Suite 330, Boston, MA 02111 USA\n" instead.

Thanks for verifying.

Thomas.

Actually the spec says "Repeated substrings may not overlap." so the correct answer for "aaaaaa" should be "aaa".

/dh

···

On 18 Jan 2008, at 23:05, ara.t.howard wrote:

And a second:
"aaaaaa" should give "aaa"

Right?

i would think that

'aaaaaa'.longest_repeating_substring #=> 'aaaaa'

the quiz did not say that the two strings could not overlap.

is this correct james?

Actually, it did. Here's the relevant line from the quiz:

   "Repeated substrings may not overlap."

James Edward Gray II

···

On Jan 18, 2008, at 5:05 PM, ara.t.howard wrote:

On Jan 18, 2008, at 2:00 PM, yermej wrote:

On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:

First testcase:
"your banana my banana" should give " banana"

--Ken

--
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.http://www.iit.edu/~kbloom1/

And a second:
"aaaaaa" should give "aaa"

Right?

i would think that

'aaaaaa'.longest_repeating_substring #=> 'aaaaa'

the quiz did not say that the two strings could not overlap.

is this correct james?

'aaaaaa'.longest_repeating_substring #=> 'aaaaa'

the quiz did not say that the two strings could not overlap.

is this correct james?

No, it isn't. Look at example from James:

Example:

       $ echo banana | ruby longest_repeated_substring.rb
       an

       OR

       $ echo banana | ruby longest_repeated_substring.rb
       na

Following your reasoning 'banana' would give 'ana', rather than 'an' (or 'na').

···

--
Radosław Bułat

http://radarek.jogger.pl - mój blog

> I wouldn't be surprised if the idea of searching only 1/2
> of the second string to prevent overlaps is wrong.. :slight_smile:

I think you're right in that it's wrong. :wink:

...snip

I'll post my solution in a reply, which is very similar to your
except in the overlap prevention code, which, I have to admit, is
pretty ugly. And I'm not even convinced that I got it right!

Dave's code can be corrected by realizing that since all suffix
strings end at the same place (the end of the string), then of the two
adjacent strings being tested, one is a suffix of the other.

This means that to detect overlap, the following test can be used:

  if prefix.length + s1.length > s2.length then
    # Overlap
  end

where "prefix" is the current prefix being checked in the two adjacent
suffix strings.

Here is a picture. Pretend in the "ababab" case, we are checking the
adjacent strings "abab" and "ababab". Since one is a suffix of the
other, they can be lined up as they appeared in the original string
(in your mind):

   abab
ababab

Now, the prefix being checked might be "aba". It matches both strings,
but if you check "aba".length + s1.length (7), it's too long to fit in
s2.length (6). In other words, they line up like this:

  ababab # s2
    abab # s1
  aba # prefix, lined up with s2
    ^
    `---- # overlap exists because the prefix
          # as lined up with s2 overlaps with s1
          # when s1 is lined up with s2 as they
          # appear in the original string. In other
          # words, the "aba" in s2 goes past the
          # beginning of the "aba" in s1.

Adding this test (instead of the s2.length / 2 test) and also testing
adjacent strings that start with the prefix currently being searched
(to find later matches if earlier ones overlap) would correct Dave's
solution and shouldn't be much more complicated.

-JJ

···

On Jan 20, 2:23 pm, "Eric I." <rubytrain...@gmail.com> wrote:

On Jan 20, 12:04 pm, Dave Thomas <d...@pragprog.com> wrote:

Hi Alex,

Have the same problem with you :slight_smile:

I think we should understand the 'adjacent' as 'not overlapped adjacent'
here. Each time we take two adjacent suffix, we check whether they
overlapped first - if so, make s2 point to next suffix, until we found a not
overlapped one.

For aaaa: on the step of comparing aa and aaa, we found they overlapped, so
we keep s1 = aa, but make s2 = aaaa, the one next to aaa.

Thanks
Jan

···

On Jan 22, 2008 4:10 PM, Alex Shulgin <alex.shulgin@gmail.com> wrote:

On Jan 20, 7:04 pm, Dave Thomas <d...@pragprog.com> wrote:
> My hack at the substring problem (nice one, James) is based on suffix
> trees.
>
> Suffix trees and suffix arrays and a great tool for substring
> operations. The idea is to create a list of all the possible suffixes
> in the original string. For "banana" this would be
>
> banana
> anana
> nana
> ana
> na
> a
>
> Because the list contains an entry starting at every position in the
> string, we know that all possible substrings of the original string
> must occur at the start of one of these list elements. The substring
> "nan", for example, occurs at the start of the 3rd entry.
>
> Now sort them
>
> a
> ana
> anana
> banana
> na
> nana
>
> Now we know that all substrings that start with the same sequence of
> characters must occur at the start of items in the list that are
> adjacent. Searching through to list to find these is a linear operation.

Hi Dave,

Thanks for the brilliant idea--this time I really feel like I've
learned something new! :slight_smile:

I even took my time to implement it in C (80 lines) and I must admit
that ruby performs very well in this case: my C version is only about
6 times faster than your ruby...

However, with the algorithm presented, I don't see how one can obtain
the correct result for the "aaaa" case. Lets see; the suffix list
would be:

aaaa
aaa
aa
a

now, sorted:

a
aa
aaa
aaaa

And the catch is--in order to obtain the correct result, that is "aa",
one needs to observe "aa" and "aaaa", but these are not adjacent. In
this case two pairs of adjacent suffixes ("aa" and "aaa", "aaa" and
"aaaa") might give "aa", but they do overlap.

So do we need a special-case treatment here?

--
Cheers,
Alex

--
Fly on wheels.

> My question is (I'm not familiar with RubyQuiz too much :)): episode
> focus on algorithm (speed) or source code (readable)?

Hopefully both. :slight_smile:

How long input string I could expect?

···

--
Radosław Bułat

http://radarek.jogger.pl - mój blog

sorry. missed that...

cheers.

a @ http://drawohara.com/

···

On Jan 18, 2008, at 5:20 PM, James Gray wrote:

  "Repeated substrings may not overlap."

--
sleep is the best meditation.
h.h. the 14th dalai lama

My solution offers speed nor readability; I went for a one-liner (if you
discount the require):

···

2008/1/18, James Gray <james@grayproductions.net>:

On Jan 18, 2008, at 3:10 PM, Radosław Bułat wrote:

> On Jan 18, 2008 10:00 PM, yermej <yermej@gmail.com> wrote:
>
> My question is (I'm not familiar with RubyQuiz too much :)): episode
> focus on algorithm (speed) or source code (readable)?

Hopefully both. :slight_smile:

James Edward Gray II

> > I wouldn't be surprised if the idea of searching only 1/2
> > of the second string to prevent overlaps is wrong.. :slight_smile:

> I think you're right in that it's wrong. :wink:

...snip

> I'll post my solution in a reply, which is very similar to your
> except in the overlap prevention code, which, I have to admit, is
> pretty ugly. And I'm not even convinced that I got it right!

Dave's code can be corrected by realizing that since all suffix
strings end at the same place (the end of the string), then of the two
adjacent strings being tested, one is a suffix of the other.

This means that to detect overlap, the following test can be used:

if prefix.length + s1.length > s2.length then
# Overlap
end

where "prefix" is the current prefix being checked in the two adjacent
suffix strings.

...

Adding this test (instead of the s2.length / 2 test) and also testing
adjacent strings that start with the prefix currently being searched
(to find later matches if earlier ones overlap) would correct Dave's
solution and shouldn't be much more complicated.

That's very close to what I do, but I go in the opposite direction. I
(try to) track the size window of how many previous sorted suffixes
must be checked against the last in the sequence.

So if I'm comparing substrings a and b in [..., a, b, ...], and if the
amount of non-overlap is smaller than the longest found substring so
far, then I enlarge the window. So now when I'm focusing on c in
[..., a, b, c, ...] I'll compare it to both a and b.

There needs to be a way to shrink the window as well, so it doesn't
grow monotonically, and so if the a and c don't have the same prefix
which exceeds the length of the longest found substring so far, the
window shrinks.

Part of my complexity comes from trying to detect cases where I can
avoid the full comparison early, and so every time I try to do the
avoid (by using "next" to start the next iteration), I must also do
some window management.

Maybe re-orienting the code in your way would simplify things. If you
have the time, I'd be very curious to see what you would come up with.

Eric

···

On Jan 20, 5:18 pm, JJ <jjnoa...@gmail.com> wrote:

On Jan 20, 2:23 pm, "Eric I." <rubytrain...@gmail.com> wrote:
> On Jan 20, 12:04 pm, Dave Thomas <d...@pragprog.com> wrote:

[Note: parts of this message were removed to make it a legal post.]

Hi Alex,

Have the same problem with you :slight_smile:

I think we should understand the 'adjacent' as 'not overlapped adjacent'
here. Each time we take two adjacent suffix, we check whether they
overlapped first - if so, make s2 point to next suffix, until we found a not
overlapped one.

Hm... but every pair of suffixes _are_ overlapped just 'cause all of
them last to the end of the text, so we cannot 'check first'.
Anyway...

For aaaa: on the step of comparing aa and aaa, we found they overlapped, so
we keep s1 = aa, but make s2 = aaaa, the one next to aaa.

This might be an appropriate hack--I'll try it when I have a
minute. :slight_smile:

···

On Jan 22, 10:51 am, Jan <jan.h....@gmail.com> wrote:

--
Cheers,
Alex

Anybody found a good long text to work with yet? The text of the U.S. Constitution perhaps? Tristram Shandy?

I would say that you should handle the biggest text you possibly can. :slight_smile:

James Edward Gray II

···

On Jan 18, 2008, at 3:42 PM, Radosław Bułat wrote:

My question is (I'm not familiar with RubyQuiz too much :)): episode
focus on algorithm (speed) or source code (readable)?

Hopefully both. :slight_smile:

How long input string I could expect?

(Oops, pushed the wrong button.)

$ cat rq153.rb
#!/usr/bin/ruby -w

require 'enumerator'

p(
  ARGV.
  join( ' ').
  reverse.
  to_enum( :each_byte).
  inject( ){ |a, b| a << (a.last || "") + b.chr}.
  map{ |a| a.reverse}.
  inject( ){ |a, b| a << b.match( /^(.+).*\1/).to_a.pop }.
  flatten.
  compact.
  sort_by{ |a| a.size}.
  last
)

$ ./rq153.rb banana
"an"
$ ./rq153.rb aaabaaa
"aaa"
$ ./rq153.rb aaaaaa
"aaa"
$ ./rq153.rb ambidextrous
nil

Don't even think of running it on a reasonably-sized text, lest you want
your PC to grind to a halt.
Morale: ruby code *can* be ugly if you put your mind to it.

Regards,
Raf

···

2008/1/20, Raf Coremans <rrafje@gmail.com>:

2008/1/18, James Gray <james@grayproductions.net>:
>
> On Jan 18, 2008, at 3:10 PM, Radosław Bułat wrote:
>
> > On Jan 18, 2008 10:00 PM, yermej <yermej@gmail.com> wrote:
> >
> > My question is (I'm not familiar with RubyQuiz too much :)): episode
> > focus on algorithm (speed) or source code (readable)?
>
> Hopefully both. :slight_smile:
>
> James Edward Gray II
>

My solution offers speed nor readability; I went for a one-liner (if you
discount the require):