Longest Repeated Substring (#153)

# A solution to RubyQuiz #153.

···

#
# Finds the longest, non-overlapping repeated substring in its input.
#
# See http://www.rubyquiz.com/quiz153.html for details.
#
# The latest version of this solution can also be found at
# http://learnruby.com/examples/ruby-quiz-153.shtml .

# When run, the input can be on the command line, come from standard
# input, or come from a file:
#
# ruby lrs.rb banana
# ruby lrs.rb "Madam I'm Adam."
# ruby lrs.rb -f homer-illiad.txt
# cat homer-illiad.txt | ruby lrs.rb

# The basic technique used by this solution is to create an array of
# all suffixes of the data. So if the input were "banana", the array
# would contain ["banana", "anana", "nana", "ana", "na", "a"]. Then
# we sort this array, so it would now contain ["a", "ana", "anana",
# "banana", "na", "nana"]. Finally we can compare neighboring entries
# in the array to see if they share a long enough prefix to beat the
# current best.

# Extra care must be taken if the substrings are not allowed to
# overlap. Consider the input "ananana"; the longest non-overlapping
# substring is "ana". The array of sorted suffixes of is ["a", "ana",
# "anana", "ananana", "na", "nana", "nanana"]. The 2nd and 3rd items
# can only have a match of "an" because the "ana" would overlap, and
# the same is true with the 3rd and 4th items. However by comparing
# the 2nd and 4th items we can get the desired result of "ana". So
# under certain circumstances we have to compare an item with more
# than just its immediate predecessor.

# This program seems to run reasonably fast. It should run in O(n *
# log n) time in most cases, assuming that Array's sort method
# provides that performance. Due to the rare cases when the program
# cannot just compare an item and its immediate predecessor, there may
# be some strange cases where it requires O(n ** 2). Because Ruby
# allows a computed substring to share the data with the original
# string (until one of the strings is altered, i.e., "copy on write"),
# the memory used is linear to the input size.

# returns the maximum of the two parameters
def max(a, b)
  a >= b ? a : b
end

# Return the longest common prefix between two strings. If max is
# specified then the longest common prefix cannot exceed it
def longest_common_prefix(s1, s2, max = nil)
  l1, l2 = s1.size, s2.size
  min = l1 < l2 ? l1 : l2
  min = min < max ? min : max if max
  min.times do |i|
    return s1.slice(0, i) if s1[i] != s2[i]
  end
  return s1.slice(0, min)
end

# Returns the longest repeated substring in a given string.
def longest_repeated_substring(string)
  size = string.length

  # put every possible suffix into an array
  suffixes = Array.new(size)
  size.times do |i|
    suffixes[i] = string.slice(i, size)
  end

  # sort the array of suffixes, so common substrings (i.e., prefixes
  # of suffixes) will be found in neighboring elements of the array
  suffixes.sort!

  best = ""
  at_least_size = 1 # the size to meet or exceed to be the new best
  distance = nil
  neighbors_to_check = 1

  # compare pairs of consecutive suffixes and see how much initial
  # commonality there is
  # (size - 1).times do |i|
  (1...size).each do |i|
    # p [i, neighbors_to_check]
    s1 = suffixes[i]

    # generally we will only need to compare the ith item and the one
    # preceding it; however if we were in a position to reject a long
    # enough common substring due to overlap issues, then we may have
    # to compare an ith item with additional preceding items;
    # neighbors_to_check tracks how many neighbors we need to check
    neighbors_to_check.downto(1) do |neighbor|
      s2 = suffixes[i - neighbor]

      # make sure that these to suffixes further apart than the size
      # of the current best; we don't explicitly track the index of
      # these suffixes, but since all suffixes go to the end of the
      # initial string, the size can be used as a proxy
      distance = (s1.size - s2.size).abs
      if distance < at_least_size
        if s1.size >= at_least_size &&
            s2.size >= at_least_size &&
            s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
          neighbors_to_check = max(neighbors_to_check, neighbor + 1)
        else
          neighbors_to_check = neighbor
        end
        next
      end

      # if neighboring suffixes don't at least match as far as the
best,
      # no need to check more carefully
      unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
        neighbors_to_check = neighbor
        next
      end

      # get the longest common prefix that's no larger than distance,
      # since at that point the substrings overlap
      best = longest_common_prefix(s1, s2, distance)
      at_least_size = best.size + 1
      if best.size == distance
        neighbors_to_check = max(neighbors_to_check, neighbor + 1)
      else
        neighbors_to_check = neighbor
      end
    end
  end

  best
end

if $0 == __FILE__
  string = nil
  if ARGV[0] == "-f"
    open(ARGV[1]) do |f|
      string = f.read
    end
  elsif ARGV.size == 0
    string = STDIN.read
  elsif ARGV[0] =~ /^-/ || ARGV.size > 1
    STDERR.puts "usage:"
    STDERR.puts " #{$0} (note: input comes from standard input)"
    STDERR.puts " #{$0} string"
    STDERR.puts " #{$0} -f filename"
    exit
  else
    string = ARGV[0]
  end

  result = longest_repeated_substring(string)
  puts result && "\"#{result}\" (#{result.length} characters)" ||
"none"
end

Oh, and the run times of my solution for _The Illiad_ and _War and
Peace_ data sets are 9.179s and 43.875s respectively. This is running
on a 2.33GHz processor.

Eric

Nice code. From the "Lost World" by Sir Arthur Conan Doyle, I get...

"ve had the escape of your life, young fellah my lad"

Todd

···

On Jan 20, 2008 1:24 PM, Eric I. <rubytraining@gmail.com> wrote:

# A solution to RubyQuiz #153.
#
# Finds the longest, non-overlapping repeated substring in its input.
#
# See Ruby Quiz - Longest Repeated Substring (#153) for details.
#
# The latest version of this solution can also be found at
# http://learnruby.com/examples/ruby-quiz-153.shtml .

# When run, the input can be on the command line, come from standard
# input, or come from a file:
#
# ruby lrs.rb banana
# ruby lrs.rb "Madam I'm Adam."
# ruby lrs.rb -f homer-illiad.txt
# cat homer-illiad.txt | ruby lrs.rb

# The basic technique used by this solution is to create an array of
# all suffixes of the data. So if the input were "banana", the array
# would contain ["banana", "anana", "nana", "ana", "na", "a"]. Then
# we sort this array, so it would now contain ["a", "ana", "anana",
# "banana", "na", "nana"]. Finally we can compare neighboring entries
# in the array to see if they share a long enough prefix to beat the
# current best.

# Extra care must be taken if the substrings are not allowed to
# overlap. Consider the input "ananana"; the longest non-overlapping
# substring is "ana". The array of sorted suffixes of is ["a", "ana",
# "anana", "ananana", "na", "nana", "nanana"]. The 2nd and 3rd items
# can only have a match of "an" because the "ana" would overlap, and
# the same is true with the 3rd and 4th items. However by comparing
# the 2nd and 4th items we can get the desired result of "ana". So
# under certain circumstances we have to compare an item with more
# than just its immediate predecessor.

# This program seems to run reasonably fast. It should run in O(n *
# log n) time in most cases, assuming that Array's sort method
# provides that performance. Due to the rare cases when the program
# cannot just compare an item and its immediate predecessor, there may
# be some strange cases where it requires O(n ** 2). Because Ruby
# allows a computed substring to share the data with the original
# string (until one of the strings is altered, i.e., "copy on write"),
# the memory used is linear to the input size.

# returns the maximum of the two parameters
def max(a, b)
  a >= b ? a : b
end

# Return the longest common prefix between two strings. If max is
# specified then the longest common prefix cannot exceed it
def longest_common_prefix(s1, s2, max = nil)
  l1, l2 = s1.size, s2.size
  min = l1 < l2 ? l1 : l2
  min = min < max ? min : max if max
  min.times do |i|
    return s1.slice(0, i) if s1[i] != s2[i]
  end
  return s1.slice(0, min)
end

# Returns the longest repeated substring in a given string.
def longest_repeated_substring(string)
  size = string.length

  # put every possible suffix into an array
  suffixes = Array.new(size)
  size.times do |i|
    suffixes[i] = string.slice(i, size)
  end

  # sort the array of suffixes, so common substrings (i.e., prefixes
  # of suffixes) will be found in neighboring elements of the array
  suffixes.sort!

  best = ""
  at_least_size = 1 # the size to meet or exceed to be the new best
  distance = nil
  neighbors_to_check = 1

  # compare pairs of consecutive suffixes and see how much initial
  # commonality there is
  # (size - 1).times do |i|
  (1...size).each do |i|
    # p [i, neighbors_to_check]
    s1 = suffixes[i]

    # generally we will only need to compare the ith item and the one
    # preceding it; however if we were in a position to reject a long
    # enough common substring due to overlap issues, then we may have
    # to compare an ith item with additional preceding items;
    # neighbors_to_check tracks how many neighbors we need to check
    neighbors_to_check.downto(1) do |neighbor|
      s2 = suffixes[i - neighbor]

      # make sure that these to suffixes further apart than the size
      # of the current best; we don't explicitly track the index of
      # these suffixes, but since all suffixes go to the end of the
      # initial string, the size can be used as a proxy
      distance = (s1.size - s2.size).abs
      if distance < at_least_size
        if s1.size >= at_least_size &&
            s2.size >= at_least_size &&
            s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
          neighbors_to_check = max(neighbors_to_check, neighbor + 1)
        else
          neighbors_to_check = neighbor
        end
        next
      end

      # if neighboring suffixes don't at least match as far as the
best,
      # no need to check more carefully
      unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
        neighbors_to_check = neighbor
        next
      end

      # get the longest common prefix that's no larger than distance,
      # since at that point the substrings overlap
      best = longest_common_prefix(s1, s2, distance)
      at_least_size = best.size + 1
      if best.size == distance
        neighbors_to_check = max(neighbors_to_check, neighbor + 1)
      else
        neighbors_to_check = neighbor
      end
    end
  end

  best
end

if $0 == __FILE__
  string = nil
  if ARGV[0] == "-f"
    open(ARGV[1]) do |f|
      string = f.read
    end
  elsif ARGV.size == 0
    string = STDIN.read
  elsif ARGV[0] =~ /^-/ || ARGV.size > 1
    STDERR.puts "usage:"
    STDERR.puts " #{$0} (note: input comes from standard input)"
    STDERR.puts " #{$0} string"
    STDERR.puts " #{$0} -f filename"
    exit
  else
    string = ARGV[0]
  end

  result = longest_repeated_substring(string)
  puts result && "\"#{result}\" (#{result.length} characters)" ||
"none"
end

How many memory do you guys use when running the code? I run of of memory
when read illiad ... I have 1G ddr.

Thanks,
Jan

···

On Jan 21, 2008 10:00 PM, Todd Benson <caduceass@gmail.com> wrote:

On Jan 20, 2008 1:24 PM, Eric I. <rubytraining@gmail.com> wrote:
> # A solution to RubyQuiz #153.
> #
> # Finds the longest, non-overlapping repeated substring in its input.
> #
> # See Ruby Quiz - Longest Repeated Substring (#153) for details.
> #
> # The latest version of this solution can also be found at
> # http://learnruby.com/examples/ruby-quiz-153.shtml .
>
> # When run, the input can be on the command line, come from standard
> # input, or come from a file:
> #
> # ruby lrs.rb banana
> # ruby lrs.rb "Madam I'm Adam."
> # ruby lrs.rb -f homer-illiad.txt
> # cat homer-illiad.txt | ruby lrs.rb
>
> # The basic technique used by this solution is to create an array of
> # all suffixes of the data. So if the input were "banana", the array
> # would contain ["banana", "anana", "nana", "ana", "na", "a"]. Then
> # we sort this array, so it would now contain ["a", "ana", "anana",
> # "banana", "na", "nana"]. Finally we can compare neighboring entries
> # in the array to see if they share a long enough prefix to beat the
> # current best.
>
> # Extra care must be taken if the substrings are not allowed to
> # overlap. Consider the input "ananana"; the longest non-overlapping
> # substring is "ana". The array of sorted suffixes of is ["a", "ana",
> # "anana", "ananana", "na", "nana", "nanana"]. The 2nd and 3rd items
> # can only have a match of "an" because the "ana" would overlap, and
> # the same is true with the 3rd and 4th items. However by comparing
> # the 2nd and 4th items we can get the desired result of "ana". So
> # under certain circumstances we have to compare an item with more
> # than just its immediate predecessor.
>
> # This program seems to run reasonably fast. It should run in O(n *
> # log n) time in most cases, assuming that Array's sort method
> # provides that performance. Due to the rare cases when the program
> # cannot just compare an item and its immediate predecessor, there may
> # be some strange cases where it requires O(n ** 2). Because Ruby
> # allows a computed substring to share the data with the original
> # string (until one of the strings is altered, i.e., "copy on write"),
> # the memory used is linear to the input size.
>
>
> # returns the maximum of the two parameters
> def max(a, b)
> a >= b ? a : b
> end
>
>
> # Return the longest common prefix between two strings. If max is
> # specified then the longest common prefix cannot exceed it
> def longest_common_prefix(s1, s2, max = nil)
> l1, l2 = s1.size, s2.size
> min = l1 < l2 ? l1 : l2
> min = min < max ? min : max if max
> min.times do |i|
> return s1.slice(0, i) if s1[i] != s2[i]
> end
> return s1.slice(0, min)
> end
>
>
> # Returns the longest repeated substring in a given string.
> def longest_repeated_substring(string)
> size = string.length
>
> # put every possible suffix into an array
> suffixes = Array.new(size)
> size.times do |i|
> suffixes[i] = string.slice(i, size)
> end
>
> # sort the array of suffixes, so common substrings (i.e., prefixes
> # of suffixes) will be found in neighboring elements of the array
> suffixes.sort!
>
> best = ""
> at_least_size = 1 # the size to meet or exceed to be the new best
> distance = nil
> neighbors_to_check = 1
>
> # compare pairs of consecutive suffixes and see how much initial
> # commonality there is
> # (size - 1).times do |i|
> (1...size).each do |i|
> # p [i, neighbors_to_check]
> s1 = suffixes[i]
>
> # generally we will only need to compare the ith item and the one
> # preceding it; however if we were in a position to reject a long
> # enough common substring due to overlap issues, then we may have
> # to compare an ith item with additional preceding items;
> # neighbors_to_check tracks how many neighbors we need to check
> neighbors_to_check.downto(1) do |neighbor|
> s2 = suffixes[i - neighbor]
>
> # make sure that these to suffixes further apart than the size
> # of the current best; we don't explicitly track the index of
> # these suffixes, but since all suffixes go to the end of the
> # initial string, the size can be used as a proxy
> distance = (s1.size - s2.size).abs
> if distance < at_least_size
> if s1.size >= at_least_size &&
> s2.size >= at_least_size &&
> s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
> neighbors_to_check = max(neighbors_to_check, neighbor + 1)
> else
> neighbors_to_check = neighbor
> end
> next
> end
>
> # if neighboring suffixes don't at least match as far as the
> best,
> # no need to check more carefully
> unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
> neighbors_to_check = neighbor
> next
> end
>
> # get the longest common prefix that's no larger than distance,
> # since at that point the substrings overlap
> best = longest_common_prefix(s1, s2, distance)
> at_least_size = best.size + 1
> if best.size == distance
> neighbors_to_check = max(neighbors_to_check, neighbor + 1)
> else
> neighbors_to_check = neighbor
> end
> end
> end
>
> best
> end
>
>
> if $0 == __FILE__
> string = nil
> if ARGV[0] == "-f"
> open(ARGV[1]) do |f|
> string = f.read
> end
> elsif ARGV.size == 0
> string = STDIN.read
> elsif ARGV[0] =~ /^-/ || ARGV.size > 1
> STDERR.puts "usage:"
> STDERR.puts " #{$0} (note: input comes from standard input)"
> STDERR.puts " #{$0} string"
> STDERR.puts " #{$0} -f filename"
> exit
> else
> string = ARGV[0]
> end
>
> result = longest_repeated_substring(string)
> puts result && "\"#{result}\" (#{result.length} characters)" ||
> "none"
> end

Nice code. From the "Lost World" by Sir Arthur Conan Doyle, I get...

"ve had the escape of your life, young fellah my lad"

Todd

--
Fly on wheels.

Hi Jan,

I have 2G. I'm surprised that 1G would be a problem, though. I can
run my code on the War and Peace text, which is 4 times larger than
The Illiad.

The problem with pasting code here is that there can be line break
issues depending on the reader you're using. I'm wondering whether a
bad line break broke something. One way around this potential issue
might be to pull the code from:

    http://learnruby.com/examples/ruby-quiz-153.shtml

Eric

···

On Jan 21, 9:28 am, Jan <jan.h....@gmail.com> wrote:

[Note: parts of this message were removed to make it a legal post.]

How many memory do you guys use when running the code? I run of of memory
when read illiad ... I have 1G ddr.

I noticed a huge difference in memory use on whether I run the code
with ruby 1.8 or 1.9.

With ruby 1.8, the suffix-array-based/-like solutions require about
120mb for the Iliad. With 1.9 ... I get an "failed to allocate memory
(NoMemoryError)" error too (virtual memory being disabled, 1GB).

Thomas.