[QUIZ] Longest Repeated Substring (#153)

James_Edward_Gray_II · 18 January 2008 13:23

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is to write a script that finds the longest repeated
substring in a given text.

Your program will be passed some text on STDIN and is expected to print the
longest repeated substring within that text to STDOUT.

Repeated substrings may not overlap. If more than one substring is repeated
with the same length, you may print any of them. If there is no repeated
substring, the result is an empty string (print nothing).

Example:

  $ echo banana | ruby longest_repeated_substring.rb
  an

  OR

  $ echo banana | ruby longest_repeated_substring.rb
  na

Make sure your code runs efficiently when passed a large text.

James_Edward_Gray_II · 18 January 2008 14:30

Posting the strings you find in a given text and/or timings is not a spoiler.

James Edward Gray II

···

On Jan 18, 2008, at 7:23 AM, Ruby Quiz wrote:

This week's Ruby Quiz is to write a script that finds the longest repeated
substring in a given text.

Make sure your code runs efficiently when passed a large text.

Denis_Hennessy · 18 January 2008 15:49

Should white space and punctuation be treated as part of the substring, or are they to be ignored?

/dh

···

On 18 Jan 2008, at 13:23, Ruby Quiz wrote:

This week's Ruby Quiz is to write a script that finds the longest repeated
substring in a given text.

Your program will be passed some text on STDIN and is expected to print the
longest repeated substring within that text to STDOUT.

Repeated substrings may not overlap. If more than one substring is repeated
with the same length, you may print any of them. If there is no repeated
substring, the result is an empty string (print nothing).

Example:

  $ echo banana | ruby longest_repeated_substring.rb
  an

  OR

  $ echo banana | ruby longest_repeated_substring.rb
  na

Make sure your code runs efficiently when passed a large text.

Dave_Mail_Archive · 18 January 2008 16:37

Must they be adjacent, or does "aaabaaa" contain the repeating substring "aaa"?

Dave

···

On Jan 18, 2008, at 7:23 AM, Ruby Quiz wrote:

Repeated substrings may not overlap. If more than one substring is repeated
with the same length, you may print any of them. If there is no repeated
substring, the result is an empty string (print nothing).

Ken_Bloom · 18 January 2008 20:40

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz
until 48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone on Ruby Talk follow the discussion. Please reply to the
original quiz message, if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

···

On Fri, 18 Jan 2008 08:23:40 -0500, Ruby Quiz wrote:
=-=-=-=-=

This week's Ruby Quiz is to write a script that finds the longest
repeated substring in a given text.

Your program will be passed some text on STDIN and is expected to print
the longest repeated substring within that text to STDOUT.

Repeated substrings may not overlap. If more than one substring is
repeated with the same length, you may print any of them. If there is
no repeated substring, the result is an empty string (print nothing).

Example:

  $ echo banana | ruby longest_repeated_substring.rb an

  OR

  $ echo banana | ruby longest_repeated_substring.rb na

Make sure your code runs efficiently when passed a large text.

First testcase:
"your banana my banana" should give " banana"

--Ken

--
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/

Denis_Hennessy · 20 January 2008 14:13

Here's my solution to the quiz. It's has O(n) behaviour and takes about 1 minute on the illiad and 5 minutes on war and peace on my 1.8GHz linux box (much longer on my powerbook).

It works by constructing a map from every letter in the text to an array of its locations. It then iterates, increasing each string (which sometimes creates splits) and removing strings which don't have at least 2 locations. Thus, for each length n, the algorithm only has to deal with the patterns which already matched with length n-1. This is easiest to see by running it with the verbose option:

$ echo banana | ruby -v find_repeats.rb
ruby 1.8.6 (2007-09-23 patchlevel 110) [powerpc-darwin9.0.0]
Initial: {"a"=>[1, 3, 5], "b"=>[0], "n"=>[2, 4], "\n"=>[6]}
Filter (len=1): {"a"=>[1, 3, 5], "n"=>[2, 4]}
Grow (len=2): {"an"=>[1, 3], "na"=>[2, 4], "a\n"=>[5]}
Filter (len=2): {"an"=>[1, 3], "na"=>[2, 4]}
Grow (len=3): {"na\n"=>[4], "nan"=>[2], "ana"=>[1, 3]}
Filter (len=3): {}
an

Cheers,
Denis

find_repeats.rb:

#!/usr/bin/env ruby

text = ARGF.read
size = text.size

# Build a map from each (1-character) string to a list of its positions
roots = {}
size.times do |o|
   s = text[o,1]
   if roots.has_key? s
     roots[s] << o
   else
     roots[s] = [o]
   end
end

puts "Initial: #{roots.inspect}" if $VERBOSE
len = 1
first = nil
while true do
   # Remove entries which don't have at least 2 non-overlapping occurances
   roots.delete_if do |s,offsets|
     count = 0
     last = nil
     offsets.each do |o|
       next if last && last+len > o
       last = o
       count += 1
     end
     count < 2
   end
   puts "Filter (len=#{len}): #{roots.inspect}" if $VERBOSE
   break if roots.size == 0
   first = roots[roots.keys[0]][0]

   # Increase len by 1 and replace each existing root with the set of longer roots
   len += 1
   new_roots = {}
   roots.each do |s,offsets|
     offsets.each do |o|
       next if o > size - len
       s = text[o,len]
       if new_roots.has_key? s
         new_roots[s] << o
       else
         new_roots[s] = [o]
       end
     end
   end
   roots = new_roots
   puts "Grow (len=#{len}): #{roots.inspect}" if $VERBOSE
end

exit if first == nil

puts text[first,len-1]

Justin_Ethier · 20 January 2008 14:30

Here is my solution. It is easy to follow but breaks down on larger inputs:

# Find first non-overlapping repeated substring contained in the input
string.
# Search space is smaller for longer substrings, so search for longest ones
first.
# Returns - Longest repeated substring, or nil if none
def longest_repeated_substring(input)
len = input.size / 2 # Max size is half total length, since strings cannot
overlap

  while len > 0
    # Find all substrings of given length
    sub_strings = {}
    for i in 0...input.size-len
      sub_str = input[i..i+len]

      if not sub_strings.has_key?(sub_str)
        sub_strings[sub_str] = i+len # Add to list, track end pos for
overlaps
      elsif sub_strings[sub_str] < i
        return sub_str # First non-overlapping match ties for longest
      end
    end

len -= 1
end

nil
end

puts longest_repeated_substring(ARGV[0])

Thanks,

Justin

Guillaume_Carbonneau · 20 January 2008 18:10

Here is my solution using Suffix arrays

class SuffixArray
  attr_accessor :suffix_array
  def initialize(the_string)
    @the_string = the_string
    @suffix_array = Array.new
    #build the suffixes
    last_index = the_string.length-1
    (0..last_index).each do |i|
      the_suffix = the_string[i..last_index]
      the_position = i
      # << is the append (or push) operator for arrays in Ruby
      @suffix_array << { :suffix=>the_suffix, :position=>the_position }
    end

    #sort the suffix array
    @suffix_array.sort! { |a,b| a[:suffix] <=> b[:suffix] }
  end

end

text = STDIN.read

highest_count = 0
longest_string = ""
sa = SuffixArray.new(text)
sa.suffix_array.each_with_index do |s,i|
  j = 1
  if sa.suffix_array[i+1]
    while sa.suffix_array[i][:suffix][0,j] ==
sa.suffix_array[i+1][:suffix][0,j]
      if j > highest_count
        highest_count = j
        longest_string = sa.suffix_array[i][:suffix][0,j]
      end
      j += 1
    end
  end

end
p longest_string

···

---------------------

I get the following times

ILLIAD :
$ time cat homer-illiad.txt | ruby suffix.rb
" who are\nperishing and coming to a bad end. We will, however, since you
so\nbid us, refrain from actual fighting, but we will make
serviceable\nsuggestions to the Argives"

real 1m15.566s
user 1m14.810s
sys 0m0.410s

WAR AND PEACE
$ time cat wrnpc12.txt | ruby suffix.rb
"\r\n\r\nThe Project Gutenberg Literary Archive Foundation has been "

real 4m55.145s
user 4m49.979s
sys 0m1.951s

–
View this message in context: http://www.nabble.com/-QUIZ--Longest-Repeated-Substring-(-153)-tp14953800p14984651.html
Sent from the ruby-talk mailing list archive at Nabble.com.

Dido_Sevilla · 21 January 2008 08:01

I'm guessing that someone wants to break a Vigenere cipher. The
longest repeated substring in text enciphered with a Vigenere cipher
has a close relationship to the length of the key, and once the key
length is determined, the cipher is halfway broken. At least that's
one application for this sort of algorithm...

···

On Jan 18, 2008 9:23 PM, Ruby Quiz <james@grayproductions.net> wrote:

Your program will be passed some text on STDIN and is expected to print the
longest repeated substring within that text to STDOUT.

--
普通じゃないのが当然なら答える私は何ができる？
普通でも普通じゃなくて感じるまま感じることだけをするよ！
http://stormwyrm.blogspot.com

ThoML · 21 January 2008 19:34

Hi,

With some delay, here is my solution. I started on Friday with my
take
on making suffix trees respect overlaps but then stopped. I have to
admit that it was Eric's solution that made take another look on
this.
The current version is about 3 times slower than Eric's submission.
It
passes all my test cases though. It also takes the idea from (IIRC)
Dave
Thomas's submission to calculate the original start position on the
basis of the string's length.

BTW it seems to me that ruby19 (cygwin, win xp) uses a __lot__ more
memory for this.

Regards,
Thomas.

#!/usr/bin/env ruby

class String
   def longest_substring_st4
        suffixes = Array.new(size) {|i| self[i..size]}
        suffixes.sort!
        common = ''
        comlen = 0
        suffixes.each_with_index do |curr, index|
            next if index == 0
            curridx = size - curr.size
            pindex = index - 1
            pindex.downto(pindex - comlen) do |i|
                pred = suffixes[i]
                psize = pred.size
                predidx = size - psize
                maxlen = [(predidx - curridx).abs, psize].min - 1
                next if maxlen < comlen
                prefix = pred[0 .. comlen]
                break if prefix != curr[0..comlen]
                (comlen + 1 .. maxlen).each do |i|
                    p = pred[i]
                    c = curr[i]
                    if p == c
                        prefix << p
                    else
                        break
                    end
                end
                common = prefix
                comlen = common.size
                break if comlen <= maxlen
            end
        end
        return common
    end
end

if __FILE__ == $0
    if ARGV.empty?
        puts String.new(STDIN.read).longest_substring_st4
    else
        ARGV.each {|f| puts
String.new(File.read(f)).longest_substring_st4}
    end
end

SERGEY_VOLKOV · 22 January 2008 10:20

Suffix tree is brilliant method, but has been invented long time ago,
besides, others submitted 'suffix tree' solution already.
It was challenge!

Please take a look at my solution (attached),
in my experiments it's faster than Eric's code.

BRs
Sergey Volkov

Q153_SerVo.rb (3.47 KB)

···

--------------------------------------------------------------
# Times
[vsv@gx Q153]$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-linux]

### Eric's code
[vsv@gx Q153]$ time ruby Q153_EricI.rb <homer-illiad.txt
" who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives" (168 characters)

real 0m14.263s
user 0m14.185s
sys 0m0.076s

### My solution
[vsv@gx Q153]$ time ruby Q153_SerVo.rb <homer-illiad.txt
who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives
[168]

real 0m7.676s
user 0m7.456s
sys 0m0.220s

### War and Peace
[vsv@gx Q153]$ time ruby Q153_SerVo.rb <wrnpc12.txt

The Project Gutenberg Literary Archive Foundation has been
[63]

real 0m30.511s
user 0m29.546s
sys 0m0.424s

### Bonus feature:
### With linebreaks removed from paragraph body
[vsv@gx Q153]$ time ruby Q153_SerVo.rb -t <homer-illiad.txt
a messenger from Jove, who, though he be not near, yet takes thought for you and pities you. He bids you get the Achaeans instantly under arms, for you shall take Troy. There are no longer divided counsels among the gods; Juno has brought them over to her own mind, and woe betides the Trojans at the hands of Jove. Remember this
[330]

real 0m14.672s
user 0m14.129s
sys 0m0.544s
------------------------------------------------------------

Urs_Meyer · 22 January 2008 12:58

Hi there,

thought first of two loops, one that moves the start position
of the candidate substring, and another that makes it longer.
It turned out that one loop that combines both is sufficient.

Without considering searching (text.index) the loop is O(n).
However, the index method is O(n). Thus, worst case (for my
program) is O(n^2).

Can one do it in O(n*log(n))?

···

--

#!/usr/bin/ruby
#
# ruby quiz 153 - longest repeated substring
#
# print the first one found if several such substrings exist
#
# Urs Meyer, 2008-01-22

--

# read text from STDIN, convert to lower case, as you like
text = STDIN.read.tr("\nA-Z"," a-z")

# start position, determines the remaining search space
start = 0
longest_substring = ""

# search while remaining search space is at least twice the
# the size of the currently longest substring

while (2 * longest_substring.size) < (text.length - start)

    # generate substring to search for with size is one bigger
    # than longest found so far
    substring = text[start...(start+longest_substring.size+1)]

# search for it
i = text.index(substring, start+substring.size)

    if i.nil?
        # nothing found, advance start position
        start += 1
    else
        # found a longer one, record it
        longest_substring = substring
    end
end

puts longest_substring

--

$ echo banana | ruby lrs.rb
an
$ echo aa | ruby lrs.rb
a
$ echo aaa | ruby lrs.rb
a
$ echo aaaa | ruby lrs.rb
aa
$

some timings...

using random characters:
$ ruby -e "puts Array.new(100000){('a'[0]+rand(26)).chr}.join" | ( time ruby lrs.rb > /dev/null )
21.015u 0.046s 0:21.39 98.4% 0+0k 0+0io 1636pf+0w

with just a bunch of 'a's the job is easier:
$ ruby -e "puts 'a'*100000" | ( time ruby lrs.rb > /dev/null )
3.390u 0.015s 0:03.50 97.1% 0+0k 0+0io 3705pf+0w

lets see whether the result is correct:
$ ruby -e "puts 'a'*100000" | ruby lrs.rb | wc -c
50001

oops, but
$ echo "a" | wc -c
2

o.k. we are fine

if you are curious, I ran it a few times on 100000 random chars:
lnxmfqu
nvpban
ibhwilj
eggfur
mbljqx
gzvxhw
...

~~
Regards
Urs

--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail

James_Edward_Gray_II · 18 January 2008 15:53

I vote that we treat all characters equally.

James Edward Gray II

···

On Jan 18, 2008, at 9:49 AM, Denis Hennessy wrote:

On 18 Jan 2008, at 13:23, Ruby Quiz wrote:

This week's Ruby Quiz is to write a script that finds the longest repeated
substring in a given text.

Your program will be passed some text on STDIN and is expected to print the
longest repeated substring within that text to STDOUT.

Repeated substrings may not overlap. If more than one substring is repeated
with the same length, you may print any of them. If there is no repeated
substring, the result is an empty string (print nothing).

Example:

  $ echo banana | ruby longest_repeated_substring.rb
  an

  OR

  $ echo banana | ruby longest_repeated_substring.rb
  na

Make sure your code runs efficiently when passed a large text.

Should white space and punctuation be treated as part of the substring, or are they to be ignored?

James_Edward_Gray_II · 18 January 2008 16:43

They do not have to be adjacent. aaa would be a valid answer for aaabaaa.

James Edward Gray II

···

On Jan 18, 2008, at 10:37 AM, Dave Thomas wrote:

On Jan 18, 2008, at 7:23 AM, Ruby Quiz wrote:

Repeated substrings may not overlap. If more than one substring is repeated
with the same length, you may print any of them. If there is no repeated
substring, the result is an empty string (print nothing).

Must they be adjacent, or does "aaabaaa" contain the repeating substring "aaa"?

yermej · 18 January 2008 21:00

And a second:
"aaaaaa" should give "aaa"

Right?

···

On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:

First testcase:
"your banana my banana" should give " banana"

--Ken

--
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.http://www.iit.edu/~kbloom1/

ThoML · 18 January 2008 22:06

First testcase:
"your banana my banana" should give " banana"

For the GNU GENERAL PUBLIC LICENSE Version 2, June 1991, I get:

"customarily used for software interchange; or, "... some more
whitespace

For the GPL3 I get:

") Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by "...

If somebody can verify this.

I think this really starts to make fun when running the script over
Mark Twains entire work (from Gutenberg) or something similar.

Dave_Mail_Archive · 20 January 2008 17:04

My hack at the substring problem (nice one, James) is based on suffix trees.

Suffix trees and suffix arrays and a great tool for substring operations. The idea is to create a list of all the possible suffixes in the original string. For "banana" this would be

banana
anana
nana
ana
na
a

Because the list contains an entry starting at every position in the string, we know that all possible substrings of the original string must occur at the start of one of these list elements. The substring "nan", for example, occurs at the start of the 3rd entry.

Now sort them

a
ana
anana
banana
na
nana

Now we know that all substrings that start with the same sequence of characters must occur at the start of items in the list that are adjacent. Searching through to list to find these is a linear operation.

A suffix array is basically this data structure. For efficiency, though, it doesn't store the actual strings. Instead it stores the offset of the start of the string (for our example, this would be [6, 4, 2, 1, 5, 3]). You'll find a whole bunch of stuff on using suffix arrays for searching and string matching, particularly in the area of DNA sequencing.

It turns out that in Ruby, you don't always have to go to the trouble of constructing the suffix array. When you take a substring in Ruby, it doesn't copy the string data. Instead, it just constructs a new string object (just a few words of memory) that references into the original string. Only when either string is modified does the string content get copied. This means the code for constructing the suffix list is both simple and relatively efficient:

   def initialize(text)
     @suffix_list = []
     len = text.length
     len.times { |i| @suffix_list << text[i, len] } # the ,len] is a hack...
     @suffix_list.sort!
   end

The only ugly part is the use of [i, len] to create the substring. Technically it should be 1..len, but that constructs a new, unnecessary range object. Using 'len' as the final index does the same thing, because the substring stops at the end of the input string, but it can be misleading to read.

The code to scan the suffix list looks at each adjacent pair, and sees if it can find a longer matching substring at the start of each that it has previously found. Because James disallowed overlapping substrings, you have to be careful to look at no more that 1/2 of the longest string. The basic loop looks like this:

s1 = @suffix_list[0]

1.upto(@suffix_list.size - 1) do |i|

s2 = @suffix_list[i]
max_possible = s2.length / 2 # stop them overlapping

       while # quick sanity check - saves doing the more expensive substring if it fails
              s1[max_length_so_far] == s2[max_length_so_far] &&
              # stop strings from overlapping
              max_length_so_far < max_possible &&
              # brute force comparison
              s1[0,max_plus_one] == s2[0,max_plus_one]

         max_length_so_far = max_plus_one
         max_plus_one += 1
         found_at = i
       end
       s1 = s2
     end

The while loop has three conditions. The last one is the money earner: it looks for a match at the start of the two strings which is one longer that the longest match found so far. If it finds it, it records this as the new longest match, and then tries for a even longer one with the current pair.

The second condition on the while loop stops us considering more than 1/2 the second string. Because our strings are sorted, if thereis overlap, the second string will always be the one that contains the overlapping segment of the first.

The first condition is just an optimization: it stops us doing the more expensive substring comparison if the last characters of the two substrings we're comparing don't match.

So, put it all together, and we have

# - - - - - - - - - - - - -

class CommonSeq

   def initialize(text)
     @suffix_list = []
     len = text.length
     len.times { |i| @suffix_list << text[i, len] } # the ,len] is a hack...
     @suffix_list.sort!
   end

   def find_substrings
     max_length_so_far = 0
     max_plus_one = 1 # save a little math in the loop
     found_at = nil

# Look at all adjacent pairs of suffices.
s1 = @suffix_list[0]

1.upto(@suffix_list.size - 1) do |i|

s2 = @suffix_list[i]
max_possible = s2.length / 2 # stop them overlapping

       while # quick sanity check - saves doing the more expensive substring if it fails
              s1[max_length_so_far] == s2[max_length_so_far] &&
              # stop strings from overlapping
              max_length_so_far < max_possible &&
              # brute force comparison
              s1[0,max_plus_one] == s2[0,max_plus_one]

         max_length_so_far = max_plus_one
         max_plus_one += 1
         found_at = i
       end
       s1 = s2
     end

     if found_at
       suffix = @suffix_list[found_at]
       [max_length_so_far, suffix[0, max_length_so_far]]
     else
       nil
     end
   end
end

if __FILE__ == $0
seq = CommonSeq.new(STDIN.read.chomp)
puts seq.find_substrings
end

# - - - - - - - - - - - - -

Run times are shown for the Illiad and War and Peace:

   dave[tmp/seq] time ruby seq.rb <homer.txt
   168
    who are
   perishing and coming to a bad end. We will, however, since you so
   bid us, refrain from actual fighting, but we will make serviceable
   suggestions to the Argives
   ruby seq.rb < homer.txt 2.82s user 0.05s system 99% cpu 2.872 total

dave[tmp/seq] time ruby seq.rb <war.txt
63

The Project Gutenberg Literary Archive Foundation has been
ruby seq.rb < war.txt 12.49s user 0.17s system 99% cpu 12.737 total

Here's the somewhat basic set of tests I used

# - - - - - - - - - - - - -
require 'seq'
require 'test/unit'

class CommonSeq
attr_reader :suffix_list
end

class TestSeq < Test::Unit::TestCase

   def test_basic_suffix_creation
     cs = CommonSeq.new("banana")
     assert_equal(%w{a ana anana banana na nana}, cs.suffix_list)
   end

   def test_empty
     cs = CommonSeq.new("")
     assert_nil cs.find_substrings
   end

   def test_length_one
     cs = CommonSeq.new("a")
     assert_nil cs.find_substrings
   end

   def test_length_two_no_match
     cs = CommonSeq.new("ab")
     assert_nil cs.find_substrings
   end

   def test_length_two_with_match
     cs = CommonSeq.new("aa")
     assert_equal [ 1, "a"], cs.find_substrings
   end

   def test_length_three_no_match
     cs = CommonSeq.new("abc")
     assert_nil cs.find_substrings
   end

   def test_length_three_adjacent_match
     cs = CommonSeq.new("aab")
     assert_equal [ 1, "a"], cs.find_substrings
   end

   def test_length_three_separated_match
     cs = CommonSeq.new("aba")
     assert_equal [ 1, "a"], cs.find_substrings
   end

   def test_does_not_find_overlapping_match_length_one
     cs = CommonSeq.new("aaa")
     assert_equal [ 1, "a"], cs.find_substrings
   end

   def test_does_not_find_overlapping_match_length_three
     cs = CommonSeq.new("aaaa")
     assert_equal [ 2, "aa"], cs.find_substrings
   end

   def test_does_not_find_overlapping_match_length_two
     cs = CommonSeq.new("ababa")
     assert_equal [ 2, "ab"], cs.find_substrings
   end

   def test_banana
     cs = CommonSeq.new("banana")
     assert_equal [ 2, "an"], cs.find_substrings
   end
end
# - - - - - - - - - - - - -

I wouldn't be surprised if the idea of searching only 1/2 of the second string to prevent overlaps is wrong..

Dave

Adam_Shelly · 21 January 2008 22:07

I had the idea to use suffix trees too, but I literally built a tree.
Each node holds a start index and length for a suffix.
As I add more substrings, I count the number of matching characters
with any previous suffixes, and record the max as needed (after
handling overlap).

Unfortunately, this turns out to be slower that the ones that just
sort all the strings,
and runs out of memory on the Illiad...

-Adam

···

---

class String
  #helper - counts number of matching characters.
  def matchlen other
    i=0
    other.each_byte{|b|
      break if b!=self[i]
      i+=1
    }
    i
  end
end

class Node
  def initialize start,len,tail=nil
    @i=start
    @l=len
    @h=tail ? tail : {}
  end

  def insert idx,len,matched=0
    match = @h[$STR[idx]]
    #add or expand child
    if match
      match.expand(idx,len,matched)
    else
      @h[$STR[idx]]=Node.new(idx,len)
    end
  end

  def expand idx,len,matched
    #count matching characters
    matchlen = $STR[@i,@l].matchlen($STR[idx,len])

    updateMax(idx-matched, matchlen+matched) if matchlen+matched > $max
    if matchlen < @l
      #split if partial match
      split_at(matchlen, idx,len)
    else
      #else add remainder of unmatched characters
      insert(idx+@l, len-@l, matchlen+matched)
    end
  end

  def split_at point, idx,len
    #one child contains the remainder of the original string(s)
    newchild = Node.new(@i+point,@l-point, @h)
    @h={}
    @h[$STR[@i+point]]=newchild
    #the other child has the remainder of the new str
    @h[$STR[idx+point]]=Node.new(idx+point, len-point)
    @l=point #shorten our length
  end

  def updateMax idx,matchlen
    #if our string ends past the beginining of the match,
    # discount the overlap
    overlap = @i+@l-idx-1
    matchlen-=overlap if overlap > 0
    if matchlen>$max
      $max = matchlen
      $maxpt = idx
    end
  end
end

$STR=ARGF.read
$max=0
$maxpt=0
slen = $STR.length
half= (slen/2.0).ceil
@root=Node.new(0,0)

#we don't have to look at any substrings longer than half the input
(0..half).each{|i|
  @root.insert(i,half)
}
#and the ones that start in the second half of the input are shorter still
(half+1..slen).each{|i|
  len = slen-i
  break if len<$max
  @root.insert(i,len)
}

puts $STR[$maxpt,$max].inspect
puts "\n#{$max} chars"

Philip_Gatt · 18 January 2008 16:50

I hope I get some time to take a shot at this one. It looks like fun.

···

On Jan 18, 2008, at 8:43 AM, James Gray wrote:

On Jan 18, 2008, at 10:37 AM, Dave Thomas wrote:

On Jan 18, 2008, at 7:23 AM, Ruby Quiz wrote:

Repeated substrings may not overlap. If more than one substring is repeated
with the same length, you may print any of them. If there is no repeated
substring, the result is an empty string (print nothing).

Must they be adjacent, or does "aaabaaa" contain the repeating substring "aaa"?

They do not have to be adjacent. aaa would be a valid answer for aaabaaa.

James Edward Gray II

Radoslaw_Bulat · 18 January 2008 21:10

It should, otherwise "banana" would give "ana" rather than "an".

My question is (I'm not familiar with RubyQuiz too much :)): episode
focus on algorithm (speed) or source code (readable)?

···

On Jan 18, 2008 10:00 PM, yermej <yermej@gmail.com> wrote:

On Jan 18, 2:38 pm, Ken Bloom <kbl...@gmail.com> wrote:

And a second:
"aaaaaa" should give "aaa"

Right?

--
Radosław Bułat

http://radarek.jogger.pl - mój blog

Topic		Replies	Views
Longest Repeated Substring (#153) ruby-talk	5	90	22 January 2008
Please Forward: Ruby Quiz Submission ruby-talk	0	117	20 January 2008
[QUIZ] Longest Repeated Substring (#153) ruby-talk	5	117	24 January 2008
Please Forward: Ruby Quiz Submission ruby-talk	0	119	20 January 2008
[SUMMARY] Longest Repeated Substring (#153) ruby-talk	0	131	24 January 2008

[QUIZ] Longest Repeated Substring (#153)

Related topics