[QUIZ] Text Munger (#76): A solution

First the solution and then the comments:

     1 #!/usr/bin/env ruby

     2 require 'unicode'

     3 class String

     4 Diacritic = Regexp.new("[\xcc\x80-\xcd\xaf]",nil,'u')
     5 Specials =
"\xc3\x86\xc3\x90\xc3\x98\xc3\x9e\xc3\x9f\xc3\xa6\xc3\xb0\xc3\xb8\xc3\xbe"
     6 Letter = Regexp.new("[A-Za-z#{Specials}](?:#{Diacritic}*)",nil,'u')
     7 Word = Regexp.new("(#{Letter})(#{Letter}+)(?=#{Letter})",nil,'u')

     8 def scramble
     9 Unicode.compose(Unicode.decompose(self).gsub(Word) {
    10 m = $~
    11 m[1] + m[2].scan(Letter).sort_by{rand}.join})
    12 end

    13 end

    14 if __FILE__ == $0
    15 while gets
    16 puts $_.chomp.scramble
    17 end
    18 end

First of all, we want the scramble to be able to handle accented
characters. For this, we require the unicode package (available as a
gem) in line 2, for its normalization functions that decompose an
accented character into a standard latin letter and a diacritic.

The letters in iso-latin1 that cannot be decomposed in a plain latin
letter + diacritic are: Thorn, Eth, AE, stroked O, sharp S. The
corresponding 9 forms (excepted the last one, the others can be small
or capital) must be treated as a "special case" in line 5.

The regular expression in line 6 identifies a possibly accented letter.

If ruby had positive zero-width positive look-behind assertions, i.e.,
Perl's /(?<=pattern)/, a word could be decomposed into letters as

Word = Regexp.new("(?<=#{Letter})(#{Letter}+)(?=#{Letter})",nil,'u')

Unfortunately, Ruby doesn't have a $<=, so we are forced to capture
the first character with the regular expression in line 7, and we have
to remember to put it back unchanged (m[1] in line 11).

I might have written lines 10 and 11 together as

$1 + $2.scan(Letter).sort_by{rand}.join})

but you don't have to be a C programmer to understand that using
global variables together with functions that alter them within two
sequence points is a bad idea (See
http://www.parashift.com/c++-faq-lite/misc-technical-issues.html#faq-39.16
).

I might have written lines 10 and 11 together as

$1 + $2.scan(Letter).sort_by{rand}.join})

but you don't have to be a C programmer to understand that using
global variables together with functions that alter them within two
sequence points is a bad idea

$~ is not a global variable : it's a local and thread-local variable (like
$_)

$1 ($2, ...) make reference to the first (second, ...) substring matched.

Guy Decoux

You are absolutely right, of course.

I hope that my mistake did not distract anybody from the point I was
trying to make.

  Stefano

···

On 23/04/06, ts <decoux@moulon.inra.fr> wrote:

> I might have written lines 10 and 11 together as

> $1 + $2.scan(Letter).sort_by{rand}.join})

> but you don't have to be a C programmer to understand that using
> global variables together with functions that alter them within two
> sequence points is a bad idea

$~ is not a global variable : it's a local and thread-local variable (like
$_)

$1 ($2, ...) make reference to the first (second, ...) substring matched.

Guy Decoux