[SUMMARY] Text Munger (#76)

Obviously, this is not an overly difficult problem. Here's a small, but pretty
easy to follow solution by Gordon Thiesfeld:

  class String
  
    def munge
      split(/\b/).munge_each.join
    end
  
  end
  
  class Array
  
    def munge_each
      map { |word| word.split(//).munge_word }
    end
  
    def munge_word
      first,last,middle = shift, pop,scramble
      "#{first}#{middle}#{last}"
    end
  
    def scramble
      sort_by{rand}
    end
  
  end
  
  if __FILE__ == $PROGRAM_NAME
  
    begin
      puts File.open(ARGV[0], 'r').read.munge
    rescue
      puts "Usage: text_munge.rb file"
    end
  
  end

The flow here is simple: bust up the document into words, munge all words, and
stitch it back together. Munging a word is just separating it into characters
and rearranging everything but the first and last character.

Probably the trickiest line in the whole deal is the first and only line in
munge(). It breaks the passed document on word boundaries, which will be every
place a word begins and ends. Thus, given the sentence:

  Here is a simple sentence, for testin' scripts.

Gordon's code will break the document into this Array:

  [ "Here", " ", "is", " ", "a", " ", "simple", " ", "sentence", ", ",
    "for", " ", "testin", "' ", "scripts", ".\n" ]

It's important to remember that this is the Regular Expression definition of
"words", including digit characters and the underscore. That's not a perfect
match for the quiz task, but was a popular choice nonetheless.

Now, I did say *all* words are scrambled and that is what I meant. A run of
four or more punctuation characters is a word, and the middle punctuation would
be scrambled. In practice, this is rare enough to be a minor issue.

I made a bit of a fuss about multi-byte characters during the discussion, which
some people did try to satisfy. It's only fair I add detail here.

There are many multi-byte character encodings, but I will focus on just the UTF8
encoding, because I am way out of my league with anything else. If you are
unfamiliar with Unicode encodings, this article is a pretty good general
introduction:

  http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

The Ruby specifics are harder to come by, sadly.

Basically, Ruby's Unicode support (UTF8 encoding only) is through regular
expressions (using matches or methods like split()). They can be made character
aware (instead of bytes) by properly setting $KCODE. Here's an example:

  $ cat byte_string.rb
  #!/usr/local/bin/ruby -w
  
  "résumé".split("").each { |chr| p chr }
  $ ruby byte_string.rb
  "r"
  "\303"
  "\251"
  "s"
  "u"
  "m"
  "\303"
  "\251"
  $ cat utf8_string.rb
  #!/usr/local/bin/ruby -w
  
  $KCODE = "UTF8"
  
  "résumé".split("").each { |chr| p chr }
  $ ruby utf8_string.rb
  "r"
  "é"
  "s"
  "u"
  "m"
  "é"

Notice that when I didn't set $KCODE, the two-byte letter is split. However,
when I tell Ruby to be Unicode aware, they stay together.

That should tell you enough background to spot the solutions that can handle it
from the ones that can't, giving you more examples to look at. Here's a
multi-byte aware solution from Ross Bamford (-Ku is a shortcut for $KCODE =
"UTF8"):

  #!/usr/local/bin/ruby -Ku
  $stdout << ARGF.read.gsub(/\B((?![\d_])\w{2,})\B/) do |w|
    $&.split(//).sort_by { rand }
  end

That's mainly just a more compact version of Gordon's script. This time though,
we are interested in the results of running it. Watch the é hop around as I
run it a few times:

  $ ruby Ross\ Bamford/scramble.rb test_document.txt
  Actheatd is my rsuémé.
  $ ruby Ross\ Bamford/scramble.rb test_document.txt
  Aaectthd is my rmséué.
  $ ruby Ross\ Bamford/scramble.rb test_document.txt
  Aatcethd is my rémsué.

Gordon's solution is non multi-byte aware out of the box. Watch how things
change with that:

  $ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
  Achttead is my résumé.
  $ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
  Aehttacd is my résumé.
  $ ruby Gordon\ Thiesfeld/scramble.rb test_document.txt
  Aheatctd is my résum?.?

In order to make sense of that, you need to see how the code found the words in
that line:

  ["Attached", " ", "is", " ", "my", " ", "r", "\303\251", "sum", "\303\251.\n"]

See how the last é is lumped in with the end punctuation? That makes the group
of characters long enough to scramble. Then they are junk characters my
terminal doesn't know how to display.

The good news is, we can magically fix Gordon's script:

  $ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
  Aatcehtd is my réumsé.
  $ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
  Athetcad is my rmuésé.
  $ ruby -Ku Gordon\ Thiesfeld/scramble.rb test_document.txt
  Atcthead is my rmséué.

We probably can't fix all the solutions like this though. It depends on how
they separated the word into letters.

The downside of this is that it makes it harder to recognize word characters,
without the digits and underscores. Filtering out punctuation is a lot harder
when we expand to such a vast definition of characters. I'm not aware of a
good Ruby solution for that issue yet. (Please enlighten me if you are!)

My thanks to Matthew for another great quiz and to all who gave it a shot.

Tomorrow we will build a simple tool for those of you showing off your code in
an IRC channel...

Hi --

···

On Thu, 27 Apr 2006, Ruby Quiz wrote:

It's important to remember that this is the Regular Expression definition of
"words", including digit characters and the underscore. That's not a perfect
match for the quiz task, but was a popular choice nonetheless.

Now, I did say *all* words are scrambled and that is what I meant. A run of
four or more punctuation characters is a word, and the middle punctuation would
be scrambled. In practice, this is rare enough to be a minor issue.

"Are you kiddin'?!" he exclaimed :slight_smile:

I thought some of the solutions addressed these problems, didn't they,
with [^\W\d_] and such?

David

--
David A. Black (dblack@wobblini.net)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

"Ruby for Rails" PDF now on sale! Ruby for Rails
Paper version coming in early May!

Hi --

It's important to remember that this is the Regular Expression definition of
"words", including digit characters and the underscore. That's not a perfect
match for the quiz task, but was a popular choice nonetheless.

Now, I did say *all* words are scrambled and that is what I meant. A run of
four or more punctuation characters is a word, and the middle punctuation would
be scrambled. In practice, this is rare enough to be a minor issue.

"Are you kiddin'?!" he exclaimed :slight_smile:

I thought some of the solutions addressed these problems, didn't they,
with [^\W\d_] and such?

I meant that scrambling long runs of punctuation didn't really seem to be a problem in actual usage.

Yes, some did correctly find the right characters to scramble. Worse, I seem to have completely overlooked this gem I was just informed about off-list:

···

On Apr 27, 2006, at 7:41 AM, dblack@wobblini.net wrote:

On Thu, 27 Apr 2006, Ruby Quiz wrote:

On Apr 27, 2006, at 7:36 AM, Stefano Taschini wrote:

On 27/04/06, Ruby Quiz <james@grayproductions.net> wrote:

We probably can't fix all the solutions like this though. It depends on how
they separated the word into letters.

The downside of this is that it makes it harder to recognize word characters,
without the digits and underscores. Filtering out punctuation is a lot harder
when we expand to such a vast definition of characters. I'm not aware of a
good Ruby solution for that issue yet. (Please enlighten me if you are!)

Actually, my solution [1] does exactly that, with the regexp
String::Letter acting on a string that has been put into D-normal
form.

Ciao
Stefano

[1] http://www.ruby-talk.org/cgi-bin/scat.rb/ruby/ruby-talk/189926

My apologies to those who didn't receive the proper credit. :frowning:

James Edward Gray II

Hi --

···

On Thu, 27 Apr 2006, James Edward Gray II wrote:

On Apr 27, 2006, at 7:41 AM, dblack@wobblini.net wrote:

Hi --

On Thu, 27 Apr 2006, Ruby Quiz wrote:

It's important to remember that this is the Regular Expression definition of
"words", including digit characters and the underscore. That's not a perfect
match for the quiz task, but was a popular choice nonetheless.

Now, I did say *all* words are scrambled and that is what I meant. A run of
four or more punctuation characters is a word, and the middle punctuation would
be scrambled. In practice, this is rare enough to be a minor issue.

"Are you kiddin'?!" he exclaimed :slight_smile:

I thought some of the solutions addressed these problems, didn't they,
with [^\W\d_] and such?

I meant that scrambling long runs of punctuation didn't really seem to be a problem in actual usage.

I gues "Are you kiddin'?!" is a bit of a stretch -- though possible --
but consider:

   "I'm just not sure...." he said.

David

--
David A. Black (dblack@wobblini.net)
Ruby Power and Light, LLC (http://www.rubypowerandlight.com)

"Ruby for Rails" PDF now on sale! Ruby for Rails
Paper version coming in early May!

You bring up some good points.

James Edward Gray II

···

On Apr 27, 2006, at 8:01 AM, dblack@wobblini.net wrote:

I gues "Are you kiddin'?!" is a bit of a stretch -- though possible --
but consider:

  "I'm just not sure...." he said.

This is jewel James, but just tell me what will they become than?

Robert

···

On 4/27/06, James Edward Gray II <james@grayproductions.net> wrote:

On Apr 27, 2006, at 8:01 AM, dblack@wobblini.net wrote:

> I gues "Are you kiddin'?!" is a bit of a stretch -- though possible --
> but consider:
>
> "I'm just not sure...." he said.

You bring up some good points.

James Edward Gray II

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein

If I understood the question correctly, David is trying to show that it is not too rare to have lengthy punctuation, which will be seen as words to scramble:

   "Are you kiddin<<<'?!" >>>he said.
   "I'm just not sure<<<..." >>>he said.

The first and last characters of those would be anchored, but the middle punctuation might move:

   "Are you kiddin<<<'!" ?>>>he said.
   "I'm just not sure<<< ."..>>>he said.

Hope that make sense.

James Edward Gray II

···

On Apr 27, 2006, at 9:16 AM, Robert Dober wrote:

On 4/27/06, James Edward Gray II <james@grayproductions.net> wrote:

On Apr 27, 2006, at 8:01 AM, dblack@wobblini.net wrote:

I gues "Are you kiddin'?!" is a bit of a stretch -- though possible --
but consider:

  "I'm just not sure...." he said.

You bring up some good points.

James Edward Gray II

This is jewel James, but just tell me what will they become than?

>>
>>
>>> I gues "Are you kiddin'?!" is a bit of a stretch -- though
>>> possible --
>>> but consider:
>>>
>>> "I'm just not sure...." he said..
>>
>> James Edward Gray II
>
>
> This is jewel James, but just tell me what will they become than?

If I understood the question correctly, David is trying to show that
it is not too rare to have lengthy punctuation, which will be seen as
words to scramble:

   "Are you kiddin<<<'?!" >>>he said.
   "I'm just not sure<<<..." >>>he said.

The first and last characters of those would be anchored, but the
middle punctuation might move:

   "Are you kiddin<<<'!" ?>>>he said.
   "I'm just not sure<<< ."..>>>he said.

Hope that make sense.

James Edward Gray II

My appologies to James and the list,

I was just referring to what seemed a funny pun to me:

because there was

"I'm just not sure...." he said.

  you brought up some good points

I was concerned about the "...." p o i n t s
and bringing them up would give us, well I donno maybe "!!!!!" ?
Oh boy I thaught that was funny, but seems I was the only 1 :frowning:

Sorry for the noise
Robert

···

On 4/27/06, James Edward Gray II <james@grayproductions.net> wrote:

On Apr 27, 2006, at 9:16 AM, Robert Dober wrote:
> On 4/27/06, James Edward Gray II <james@grayproductions.net> wrote:
>> On Apr 27, 2006, at 8:01 AM, dblack@wobblini.net wrote:

--
Deux choses sont infinies : l'univers et la bêtise humaine ; en ce qui
concerne l'univers, je n'en ai pas acquis la certitude absolue.

- Albert Einstein