[QUIZ] TumbleDRYer (#53)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

···

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Hugh Sasse

The Pragmatic Programmers (and others) recommend that you remove repetition from
code. They call this principle DRY: Don't Repeat Yourself. Benefits include
not having to track down multiple places to make a change affecting the whole
system, prevention of inconsistency, as well as reduced debugging time.
Sometimes code ends up being repetitive. Take this MySQL example from:

  http://manuals.rubyonrails.com/read/chapter/48

  CREATE TABLE `authors` (
    `id` int(11) NOT NULL auto_increment,
    `firstname` varchar(50) NOT NULL default '',
    `name` varchar(50) NOT NULL default '',
    `nickname` varchar(50) NOT NULL default '',
    `contact` varchar(50) NOT NULL default '',
    `password` varchar(50) NOT NULL default '',
    `description` text NOT NULL,
    PRIMARY KEY (`id`)
  ) TYPE=MyISAM AUTO_INCREMENT=3 ;
  
  CREATE TABLE `categories` (
    `id` int(11) NOT NULL auto_increment,
    `name` varchar(20) NOT NULL default '',
    `description` varchar(70) NOT NULL default '',
    PRIMARY KEY (`id`)
  ) TYPE=MyISAM AUTO_INCREMENT=3 ;
  
  CREATE TABLE `categories_documents` (
    `category_id` int(11) NOT NULL default '0',
    `document_id` int(11) NOT NULL default '0',
  ) TYPE=MyISAM ;
  
  CREATE TABLE `documents` (
    `id` int(11) NOT NULL auto_increment,
    `title` varchar(50) NOT NULL default '',
    `description` text NOT NULL,
    `author_id` int(11) NOT NULL default '0',
    `date` date NOT NULL default '0000-00-00',
    `filename` varchar(50) NOT NULL default '',
    PRIMARY KEY (`id`),
    KEY `document` (`title`)
  ) TYPE=MyISAM AUTO_INCREMENT=14 ;

clearly there's a lot of repetition in there. It would be nice if it were
possible to eliminate it. This quiz is about writing a program which we'll call
TumbleDRYer, which, given repetitive input, will create ruby code to generate
that input, such that the generating program has much less repetition in it.
One way might be to generate a program like this:

  # Undoes the work of TumbleDRYer :slight_smile:
  class WashingMachine
    def initialize
      @create = 'CREATE TABLE'
      @type = 'TYPE=MyISAM'
      @varchar_50 = "varchar(50) NOT NULL default '',"
      @varchar_20 = "varchar(20) NOT NULL default '',"
      @int_11 = "int(11) NOT NULL auto_increment,"
      @int_11_0 = "int(11) NOT NULL default '0',"
      @pk = "PRIMARY KEY (`id`)"
      @auto = "AUTO_INCREMENT="
    end
   
    def output
      print <<EOF
     #{@create} `authors` (
       `id` #{@int_11}
       `firstname` #{@varchar_50}
       `name` #{@varchar_50}
       `nickname` #{@varchar_50}
       `contact` #{@varchar_50}
       `password` #{@varchar_50}
       `description` text NOT NULL,
       #{@pk}
     ) #{@type} #{@auto}3 ;

     #{@create} `categories` (
       `id` #{@int_11}
       `name` #{@varchar_20}
       `description` varchar(70) NOT NULL default '',
       #{@pk}
     ) #{@type} #{@auto}3 ;

     #{@create} `categories_documents` (
       `category_id` #{@int_11_0}
       `document_id` #{@int_11_0}
     ) #{@type} ;

     #{@create} `documents` (
       `id` #{@int_11}
       `title` #{@varchar_50}
       `description` text NOT NULL,
       `author_id` #{@int_11_0}
       `date` date NOT NULL default '0000-00-00',
       `filename` #{@varchar_50}
       #{@pk},
       KEY `document` (`title`)
     ) #{@type} #{@auto}14 ;
EOF
    end
  end
   
  WashingMachine.new.output

or something of the kind. Or it might be worth using ERB.

Obviously the more human readable to the resulting program is the better: the
point is to produce something that simplifies maintenance, rather than data
compression. Techniques from data compression would seem to be useful, however.
Building up a dictionary of character groups and their frequencies might be
useful, and various strategies are possible for how to fragment the string into
repeated chunks, such as: the longest repeated substring; splitting on some kind
of boundary; or techniques from Ziv-Lempel coding to create the groups.

[snip]

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Hugh Sasse

[snip]

        CREATE TABLE `documents` (
          `id` int(11) NOT NULL auto_increment,
          `title` varchar(50) NOT NULL default '',
          `description` text NOT NULL,
          `author_id` int(11) NOT NULL default '0',
          `date` date NOT NULL default '0000-00-00',
          `filename` varchar(50) NOT NULL default '',
          PRIMARY KEY (`id`),
          KEY `document` (`title`)
        ) TYPE=MyISAM AUTO_INCREMENT=14 ;

This looks like a fun one to me, as it's one of the main things I use
Ruby for. Looking forward to seeing what others do with it -- and
hoping to clean up an example of my own.

Since I had to rtfm [1] for a couple simple SQL questions (yeah, my
SQL is probably rustier than most), thought I'd save others the time.

q) Why doesn't the 'description' field have a 'default'?
a) "BLOB and TEXT columns cannot be assigned a default value."

q) How should you treat columns that allow NULLs (none of the examples do)?
a) "If neither NULL nor NOT NULL is specified, the column is treated
as though NULL had been specified."

[1] http://dev.mysql.com/doc/refman/5.0/en/create-table.html

···

On 10/28/05, Ruby Quiz <james@grayproductions.net> wrote:

--
Bill Guindon (aka aGorilla)

I have a beige g3 that apple will not support past 10.2. What version of ruby is the latest version that will run on a 10.2 system and where can I get it. 1.8.3 does not compile. Is there anything else I would need to run ruby?

Thanks,
Donald

My approach was to look for repeated "phrases" made up of one or more "words" within a line. Running my solution against the sample input data produces the following output:

class WashingMachine
   def initialize
     @id = '`id` int(11) NOT NULL auto_increment,'
     @va = 'varchar(50) NOT NULL default \'\','
     @in = 'int(11) NOT NULL default \'0\','
     @no = 'NOT NULL default'
     @ty = ') TYPE=MyISAM'
     @de = '`description`'
     @cr = 'CREATE TABLE'
     @pr = 'PRIMARY KEY'
   end
   def output
print <<EOF
#{@cr} `authors` (
   #{@id}
   `firstname` #{@va}
   `name` #{@va}
   `nickname` #{@va}
   `contact` #{@va}
   `password` #{@va}
   #{@de} text NOT NULL,
   #{@pr} (`id`)
#{@ty} AUTO_INCREMENT=3 ;

#{@cr} `categories` (
   #{@id}
   `name` varchar(20) #{@no} '',
   #{@de} varchar(70) #{@no} '',
   #{@pr} (`id`)

#{@ty} AUTO_INCREMENT=3 ;
   #{@cr} `categories_documents` (
   `category_id` #{@in}
   `document_id` #{@in}
#{@ty} ;

#{@cr} `documents` (
   #{@id}
   `title` #{@va}
   #{@de} text NOT NULL,
   `author_id` #{@in}
   `date` date #{@no} '0000-00-00',
   `filename` #{@va}
   #{@pr} (`id`),
   KEY `document` (`title`)
#{@ty} AUTO_INCREMENT=14 ;
EOF
   end
end

Here's my solution:

require 'strscan'
require 'abbrev'

class TumbleDRYer

   # minimum length of a phrase to consider
   MIN_PHRASE = 10

   # minimum times a phrase must occur to consider
   MIN_OCCUR = 3

   # minimum length for abbreviation
   MIN_ABBR = 2

   def initialize(string)
     @input = string
   end

   def dry

     # this will accumulate a list of repeated phrases to condense
     phrases = Array.new

     # this will receive the abbreviation for each phrase
     abbr = Hash.new

     lines = @input.to_a
     loop do

       # process the input data by lines. we find "phrases" by
       # first finding the start and end of each "word" in the line,
       # and then combining those words into longer phrases. for
       # each phrase, we count the number of times it occurs in the
       # total input.
       phr = Hash.new
       lines.each do |line|
         s = StringScanner.new(line)
         words = Array.new
         loop do
           s.scan_until(/(?=\S)/) or break
           beg = s.pos
           s.scan(/\S+/)
           words << [ beg, s.pos ]
         end
         require 'pp'

         # combine words to make 'phrases'
         combos(words)

         # accumulate phrases, counting their occurences.
         # skip phrases that are too short.
         words.each do |from, to|
           p = line[from, to - from]
           next unless p.length >= MIN_PHRASE
           phr[p] ||= 0
           phr[p] += 1
         end
       end

       # get the longest phrase that occurs the most times
       longest = phr.sort_by { |k,v| -(k.length * 1000 + v)
         }.find { |k,v| v >= MIN_OCCUR } or break
       phrase, occurs = longest

       # save the phrase, and then blank it out of the input data
       # so we can search for more phrases
       phrases << phrase
       lines.each { |line| line.gsub!(phrase, ' ' * phrase.length) }

     end

     # now we have all the phrases we want to replace.
     # find unique abbreviations for each phrase.
     temp = Hash.new
     phrases.each do |phrase|
       key = phrase.scan(/\w+/).flatten.to_s.downcase
       key = '_' + key unless key =~ /^[_a-zA-Z]/
       key += '_' while temp.has_key? key
       temp[key] = phrase
     end
     temp.keys.abbrev.sort.each do |s, key|
       phrase = temp[key]
       abbr[phrase] = s if abbr[phrase].nil? ||
         abbr[phrase].length < MIN_ABBR
     end

     # generate the output class
     puts "class WashingMachine"
     puts " def initialize"
     phrases.each do |phrase|
       puts ' @' + abbr[phrase] + " = '" +
         phrase.gsub("'", "\\\\'") + "'"
       @input.gsub!(phrase, '#{@' + abbr[phrase] + '}')
     end
     puts " end\n"
     puts " def output\nprint <<EOF"
     puts @input
     puts "EOF\n end\n"
     puts "end"

   end

   private

   def combos(arr, max = arr.size - 1, i = 0)
     (i+1..max).each do |j|
       arr << [ arr[i][0], arr[j][1] ]
     end
     combos(arr, max, i + 1) if i < max - 1
   end

end

TumbleDRYer.new(ARGF.read).dry

I decided to cheat, here is my solution:

bschroed@oerfi:~/svn/projekte/tumblezip$ ./tumblezip.rb tumblezip.rb >
tumblezip.tumbled.rb
bschroed@oerfi:~/svn/projekte/tumblezip$ cat tumblezip.tumbled.rb
require "zlib"
puts Zlib::Inflate.inflate(DATA.read)
__END__
xSVÔ/-.ÒOÊÌÓ/*Mªäâ*J-,Í,JUP¯ÊÉLRçâ*(-)VPÕ
+4u¸@@U,³²òÌKËI,IÕËÐ▒.!zE(c))0åJññ(r)~.ññJ:
FT­á▒äîÑÈ1}/
bschroed@oerfi:~/svn/projekte/tumblezip$ ruby tumblezip.tumbled.rb
#!/usr/bin/ruby

require 'zlib'

puts %(require "zlib"),
     %(puts Zlib::Inflate.inflate(DATA.read)),
     "__END__",
     Zlib::Deflate.deflate(ARGF.read)

It definitely fails on human readability, though compression and dry
is quite good :wink:

cheers,

Brian

···

--
http://ruby.brian-schroeder.de/

Stringed instrument chords: http://chordlist.brian-schroeder.de/

[snip]

> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>
> by Hugh Sasse

[snip]

> CREATE TABLE `documents` (
> `id` int(11) NOT NULL auto_increment,
> `title` varchar(50) NOT NULL default '',
> `description` text NOT NULL,
> `author_id` int(11) NOT NULL default '0',
> `date` date NOT NULL default '0000-00-00',
> `filename` varchar(50) NOT NULL default '',
> PRIMARY KEY (`id`),
> KEY `document` (`title`)
> ) TYPE=MyISAM AUTO_INCREMENT=14 ;

This looks like a fun one to me, as it's one of the main things I use
Ruby for. Looking forward to seeing what others do with it -- and
hoping to clean up an example of my own.

Since I had to rtfm [1] for a couple simple SQL questions (yeah, my
SQL is probably rustier than most), thought I'd save others the time.

q) Why doesn't the 'description' field have a 'default'?
a) "BLOB and TEXT columns cannot be assigned a default value."

q) How should you treat columns that allow NULLs (none of the examples do)?
a) "If neither NULL nor NOT NULL is specified, the column is treated
as though NULL had been specified."

[1] http://dev.mysql.com/doc/refman/5.0/en/create-table.html

nm, I seriously misread the challenge. either way, looking forward to
it, and might take a stab at it.

···

On 10/28/05, Bill Guindon <agorilla@gmail.com> wrote:

On 10/28/05, Ruby Quiz <james@grayproductions.net> wrote:

--
Bill Guindon (aka aGorilla)

--
Bill Guindon (aka aGorilla)

Bill Guindon wrote:

Since I had to rtfm [1] for a couple simple SQL questions (yeah, my
SQL is probably rustier than most), thought I'd save others the time.

q) Why doesn't the 'description' field have a 'default'?
a) "BLOB and TEXT columns cannot be assigned a default value."

I'm pretty sure that's DBMS-specific (MySQL), and I'm pretty sure SQLite
(for example) allows this. I'm actually not sure about other major DBMSes.

q) How should you treat columns that allow NULLs (none of the examples
do)?
a) "If neither NULL nor NOT NULL is specified, the column is treated
as though NULL had been specified."

That's standard SQL. Most schema-DDLs include optional columns; I like to
omit both the NULL and the NOT.

While we're looking at the MySQL DDL:

q) What's with the backticks (`)?
a) That's MySQL's way of quoting column names. They're unnecessary as long
as the name is alphabetic and not a keyword. I don't know of any other DBMS
that uses backticks for this.

q) What's with the TYPE=?
a) Again, a MySQL feature - MySQL supports different storage engines (and
here was I thinking a DBMS was supposed to manage storage...) and this
specifies which to use. MyISAM is the crippled default one which doesn't
support referential integrity.

q) Why doesn't auto_increment work with other databases?
a) Because defining an auto-increment field is not specified in SQL-92. They
can all do it, though.

q) Why are data types, "default" and "auto_increment" in lowercase when
other keywords are uppercase?
a) I don't know.

For what it's worth, MySQL 5 is a lot better than MySQL 4 (now with in-line
views, I believe), but I'd still recommend Postgresql or really anything
else over it.

Cheers,
Dave

Of course it does not. Look at the patches included in the fink
package for 10.3.
I guess the package would build on 10.2 as well. But I cannot try myself.
Specifically I am not sure the environ patch would work on 10.2 as well.

hth

Michal Suchanek

···

On 10/31/05, Donald Huebschman <dhuebsch@mac.com> wrote:

I have a beige g3 that apple will not support past 10.2. What version
of ruby is the latest version that will run on a 10.2 system and
where can I get it. 1.8.3 does not compile. Is there anything else I
would need to run ruby?

Interesting solution. I've overlooked a lot of the functionality of
strscan, so that is worth knowing about. The approach for
generating mnemonics was nice as well, as well as the reverse sort
by negation.
        Thank you,
        Hugh

I decided to cheat, here is my solution:

        [...]

xSVÔ/-.ÒOÊÌÓ/*Mªäâ*J-,Í,JUP¯ÊÉLRçâ*(-)VPÕ
+4u¸@@U,³²òÌKËI,IÕËÐ▒.!zE(c))0åJññ(r)~.ññJ:
FT­á▒äîÑÈ1}/
bschroed@oerfi:~/svn/projekte/tumblezip$ ruby tumblezip.tumbled.rb
#!/usr/bin/ruby

require 'zlib'

puts %(require "zlib"),
     %(puts Zlib::Inflate.inflate(DATA.read)),
     "__END__",
     Zlib::Deflate.deflate(ARGF.read)

It definitely fails on human readability, though compression and dry

:slight_smile: Well, as the reason for this idea was clearly an aid to code
maintenance, I think I'd fail this, but you'd have to get marks for
subversion, in both the subversIVE and svn senses of the word :slight_smile:

is quite good :wink:

cheers,

Brian

        Thank you,
        Hugh

···

On Wed, 2 Nov 2005, Brian Schröder wrote: