Obviously, this is not an overly difficult problem. Here's a small, but pretty
easy to follow solution by Gordon Thiesfeld:
class String
def munge
split(/\b/).munge_each.join
end
end
class Array
def munge_each
map { |word| word.split(//).munge_word }
end
def munge_word
first,last,middle = shift, pop,scramble
"#{first}#{middle}#{last}"
end
def scramble
sort_by{rand}
end
end
if __FILE__ == $PROGRAM_NAME
begin
puts File.open(ARGV[0], 'r').read.munge
rescue
puts "Usage: text_munge.rb file"
end
end
The flow here is simple: bust up the document into words, munge all words, and
stitch it back together. Munging a word is just separating it into characters
and rearranging everything but the first and last character.
Probably the trickiest line in the whole deal is the first and only line in
munge(). It breaks the passed document on word boundaries, which will be every
place a word begins and ends. Thus, given the sentence:
Here is a simple sentence, for testin' scripts.
Gordon's code will break the document into this Array:
[ "Here", " ", "is", " ", "a", " ", "simple", " ", "sentence", ", ",
"for", " ", "testin", "' ", "scripts", ".\n" ]
It's important to remember that this is the Regular Expression definition of
"words", including digit characters and the underscore. That's not a perfect
match for the quiz task, but was a popular choice nonetheless.
Now, I did say *all* words are scrambled and that is what I meant. A run of
four or more punctuation characters is a word, and the middle punctuation would
be scrambled. In practice, this is rare enough to be a minor issue.
I made a bit of a fuss about multi-byte characters during the discussion, which
some people did try to satisfy. It's only fair I add detail here.
There are many multi-byte character encodings, but I will focus on just the UTF8
encoding, because I am way out of my league with anything else. If you are
unfamiliar with Unicode encodings, this article is a pretty good general
introduction:
http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html
The Ruby specifics are harder to come by, sadly.
Basically, Ruby's Unicode support (UTF8 encoding only) is through regular
expressions (using matches or methods like split()). They can be made character
aware (instead of bytes) by properly setting $KCODE. Here's an example:
$ cat byte_string.rb
#!/usr/local/bin/ruby -w
"résumé".split("").each { |chr| p chr }
$ ruby byte_string.rb
"r"
"\303"
"\251"
"s"
"u"
"m"
"\303"
"\251"
$ cat utf8_string.rb
#!/usr/local/bin/ruby -w
$KCODE = "UTF8"
"résumé".split("").each { |chr| p chr }
$ ruby utf8_string.rb
"r"
"é"
"s"
"u"
"m"
"é"
Notice that when I didn't set $KCODE, the two-byte letter is split. However,
when I tell Ruby to be Unicode aware, they stay together.
That should tell you enough background to spot the solutions that can handle it
from the ones that can't, giving you more examples to look at. Here's a
multi-byte aware solution from Ross Bamford (-Ku is a shortcut for $KCODE =
"UTF8"):
#!/usr/local/bin/ruby -Ku
$stdout << ARGF.read.gsub(/\B((?![\d_])\w{2,})\B/) do |w|
$&.split(//).sort_by { rand }
end
That's mainly just a more compact version of Gordon's script. This time though,
we are interested in the results of running it. Watch the é hop around as I
run it a few times:
$ ruby Ross\ Bamford/scramble.rb test_document.txt
Actheatd is my rsuémé.
$ ruby Ross\ Bamford/scramble.rb test_document.txt
Aaectthd is my rmséué.
$ ruby Ross\ Bamford/scramble.rb test_document.txt
Aatcethd is my rémsué.
Gordon's solution is non multi-byte aware out of the box. Watch how things
change with that:
$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
Achttead is my résumé.
$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
Aehttacd is my résumé.
$ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
Aheatctd is my résum?.?
In order to make sense of that, you need to see how the code found the words in
that line:
["Attached", " ", "is", " ", "my", " ", "r", "\303\251", "sum", "\303\251.\n"]
See how the last é is lumped in with the end punctuation? That makes the group
of characters long enough to scramble. Then they are junk characters my
terminal doesn't know how to display.
The good news is, we can magically fix Gordon's script:
$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
Aatcehtd is my réumsé.
$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
Athetcad is my rmuésé.
$ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
Atcthead is my rmséué.
We probably can't fix all the solutions like this though. It depends on how
they separated the word into letters.
The downside of this is that it makes it harder to recognize word characters,
without the digits and underscores. Filtering out punctuation is a lot harder
when we expand to such a vast definition of characters. I'm not aware of a
good Ruby solution for that issue yet. (Please enlighten me if you are!)
My thanks to Matthew for another great quiz and to all who gave it a shot.
Tomorrow we will build a simple tool for those of you showing off your code in
an IRC channel...