[SOLUTION] Quoted Printable (#23)

I am a ruby newbie, so be kind. I wrote the code myself, but blatantly
stole Dave Burt's test cases - thank you. I also found one test case
that breaks my code (and Dave's) that I am not sure what the correct
answer is, but I know mine is wrong:

Consider:
"===
                 \n"
which will cause a new space to be found at the end of a string - is
it the case that all space at the end of the line is encoded
(increasing size rather needlessly), but simplifying this case? Either
way, I am too tired and have other important stuff to do so I will let
it go.

Please feel free to let me know where I did not do things the "Ruby
way" as I am primarily a C++ and Perl guy, but very interested in
getting better at Ruby.

Thanks
pth

···

#
# == Synopsis
#
# Ruby Quiz #23
#
# The quoted printable encoding is used in primarily in email, thought it has
# recently seen some use in XML areas as well. The encoding is simple to
# translate to and from.
#
# This week's quiz is to build a filter that handles quoted printable
# translation.
#
# Your script should be a standard Unix filter, reading from files listed on
# the command-line or STDIN and writing to STDOUT. In normal operation, the
# script should encode all text read in the quoted printable format. However,
# your script should also support a -d command-line option and when present,
# text should be decoded from quoted printable instead. Finally, your script
# should understand a -x command-line option and when given, it should encode
# <, > and & for use with XML.
#
# == Usage
#
# ruby quiz23.rb [-d | --decode ] [ -x | --xml ]
#
# == Author
# Patrick Hurley, Cornell-Mayo Assoc
#
# == Copyright
# Copytright (c) 2005 Cornell-Mayo Assoc
# Licensed under the same terms as Ruby.
#

require 'optparse'
require 'rdoc/usage'

module QuotedPrintable
  MAX_LINE_PRINTABLE_ENCODE_LENGTH = 76

  def from_qp
    result = self.gsub(/=\r\n/, "")
    result.gsub!(/\r\n/m, $/)
    result.gsub!(/=([\dA-F]{2})/) { $1.hex.chr }
    result
  end

  def to_qp(handle_xml = false)
    char_mask = if (handle_xml)
                  /[^!-%,-;=?-~\s]/
                else
                  /[^!-<>-~\s]/
                end

    # encode the non-space characters
    result = self.gsub(char_mask) { |ch| "=%02X" % ch[0] }
    # encode the last space character at end of line
    result.gsub!(/(\s)(?=#{$/})/o) { |ch| "=%02X" % ch[0] }

    lines = result.scan(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/);
    lines.join("=\n").gsub(/#{$/}/m, "\r\n")
  end

  def QuotedPrintable.encode
    STDOUT.binmode
    while (line = gets) do
      print line.to_qp
    end
  end

  def QuotedPrintable.decode
    STDIN.binmode
    while (line = gets) do
      # I am a ruby newbie, and I could
      # not get gets to get the \r\n pairs
      # no matter how I set $/ - any pointers?
      line = line.chomp + "\r\n"
      print line.from_qp
    end
  end

end

class String
  include QuotedPrintable
end

if __FILE__ == $0

  opts = OptionParser.new
  opts.on("-h", "--help") { RDoc::usage; }
  opts.on("-d", "--decode") { $decode = true }
  opts.on("-x", "--xml") { $handle_xml = true }

  opts.parse!(ARGV) rescue RDoc::usage('usage')

  if ($decode)
    QuotedPrintable.decode()
  else
    QuotedPrintable.encode()
  end
end

"Patrick Hurley" <phurley@gmail.com> submitted:

I am a ruby newbie, so be kind. I wrote the code myself, but blatantly
stole Dave Burt's test cases - thank you. I also found one test case

Quiz tests are for sharing - I think that's established. In any case, you're
welcome to them.

that breaks my code (and Dave's) that I am not sure what the correct
answer is, but I know mine is wrong:

Consider:
"===
                \n"
which will cause a new space to be found at the end of a string - is
it the case that all space at the end of the line is encoded
(increasing size rather needlessly), but simplifying this case? Either
way, I am too tired and have other important stuff to do so I will let
it go.

I see no problem. I've added that test case, and both our solutions
pass.

http://www.dave.burt.id.au/ruby/test-quoted-printable.rb

Please feel free to let me know where I did not do things the "Ruby
way" as I am primarily a C++ and Perl guy, but very interested in
getting better at Ruby.
...
                 /[^!-<>-~\s]/

Bug: "\f" doesn't get escaped (it's part of /\s/). Probably "\r" as well;
that's harder to test on windows.

I see no other problems. Your optparse is better (i.e. shorter) than mine
:). Your
(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie, though. Also, you use global
variables - we rubyists shun these: use locals.

Cheers,
Dave

(from Patrick's solution--for those who missed it)

     while (line = gets) do
       # I am a ruby newbie, and I could
       # not get gets to get the \r\n pairs
       # no matter how I set $/ - any pointers?
       ...

James Edward Gray II

Hi Florian,

As always, I'm amazed by your concise code. But your solution seems to be
failing a bunch of my tests (and not just by chopping lines early, which is
allowed):

encoding:
- escapes mid-line whitespace
- escapes '~'
- allows too-long lines (my tests saw up to 104 characters on a line)
- allows unescaped whitespace on the end of a line (as long as it's preceded
by escaped whitespace)
decoding:
- doesn't ignore trailing literal whitespace

Cheers,
Dave

···

"Florian Gross" <flgr@ccan.de> wrote:

Matthew Moss wrote:

Here is my partial solution for the Quoted Printable quiz. I'm still
pretty new to Ruby, so it took me a while to get what you see here. I
think the only thing I didn't get to adding was line length checks.

And here's mine as well. Sorry for being late -- I coded this up on
Friday and forgot about it until today.

It ought to handle everything correctly (including proper wrapping of
lines that end in encoded characters) and it does most of the work with
a few simple regular expressions.

Thanks for the kind response.

When I said the test case failed, I meant the actually output our
resulting output encodeing the line has trailing space at the end of a
line. We both escape trailing spaces before we break lines - if the
line breaking moves some code is that not an issue? (the continuation
= might mean that it is not).

Yup there was an issue with masks I fixed that and removed the globals
(my perl just throwing in a $ when in doubt :slight_smile: There was also a bug
in the command line driver, which I have fixed. The patched code
follows

(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie,

I did this to allow the use of a gsub, which is much faster than the
looping solution. The look aheads and general uglyness handle the
special cases. I probably should use /x and space it out and comment,
but when I am in the regexp zone, I know what I am typing <grin>.

···

#
# == Synopsis
#
# Ruby Quiz #23
#
# The quoted printable encoding is used in primarily in email, thought it has
# recently seen some use in XML areas as well. The encoding is simple to
# translate to and from.
#
# This week's quiz is to build a filter that handles quoted printable
# translation.
#
# Your script should be a standard Unix filter, reading from files listed on
# the command-line or STDIN and writing to STDOUT. In normal operation, the
# script should encode all text read in the quoted printable format. However,
# your script should also support a -d command-line option and when present,
# text should be decoded from quoted printable instead. Finally, your script
# should understand a -x command-line option and when given, it should encode
# <, > and & for use with XML.
#
# == Usage
#
# ruby quiz23.rb [-d | --decode ] [ -x | --xml ]
#
# == Author
# Patrick Hurley, Cornell-Mayo Assoc
#
# == Copyright
# Copytright (c) 2005 Cornell-Mayo Assoc
# Licensed under the same terms as Ruby.
#

require 'optparse'
require 'rdoc/usage'

module QuotedPrintable
  MAX_LINE_PRINTABLE_ENCODE_LENGTH = 76

  def from_qp
    result = self.gsub(/=\r\n/, "")
    result.gsub!(/\r\n/m, $/)
    result.gsub!(/=([\dA-F]{2})/) { $1.hex.chr }
    result
  end

  def to_qp(handle_xml = false)
    char_mask = if (handle_xml)
                  /[\x00-\x08\x0b-\x1f\x7f-\xff=<>&]/
                else
                  /[\x00-\x08\x0b-\x1f\x7f-\xff=]/
                end

    # encode the non-space characters
    result = self.gsub(char_mask) { |ch| "=%02X" % ch[0] }
    # encode the last space character at end of line
    result.gsub!(/(\s)(?=#{$/})/o) { |ch| "=%02X" % ch[0] }

    lines = result.scan(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/);
    lines.join("=\n").gsub(/#{$/}/m, "\r\n")
  end

  def QuotedPrintable.encode(handle_xml=false)
    STDOUT.binmode
    while (line = gets) do
      print line.to_qp(handle_xml)
    end
  end

  def QuotedPrintable.decode
    STDIN.binmode
    while (line = gets) do
      # I am a ruby newbie, and I could
      # not get gets to get the \r\n pairs
      # no matter how I set $/ - any pointers?
      line = line.chomp + "\r\n"
      print line.from_qp
    end
  end

end

class String
  include QuotedPrintable
end

if __FILE__ == $0

  decode = false
  handle_xml = true
  opts = OptionParser.new
  opts.on("-h", "--help") { RDoc::usage; }
  opts.on("-d", "--decode") { decode = true }
  opts.on("-x", "--xml") { handle_xml = true }

  opts.parse!(ARGV) rescue RDoc::usage('usage')

  if (decode)
    QuotedPrintable.decode()
  else
    QuotedPrintable.encode(handle_xml)
  end
end

Dave Burt wrote:

Hi Florian,

Moin Dave.

As always, I'm amazed by your concise code. But your solution seems to be failing a bunch of my tests (and not just by chopping lines early, which is allowed):

Thanks, I'll have a look.

encoding:
- escapes mid-line whitespace

I'm not sure I get this. Am I incorrectly escaping mid-line whitespace or am I incorrectly not escaping it? And what is mid-line whitespace?

- escapes '~'

Heh, classic off-by-one. Easily fixed by changing the Regexp. See source below.

- allows too-long lines (my tests saw up to 104 characters on a line)

Any hints on when this is happening? I can't see why and when this would happen.

- allows unescaped whitespace on the end of a line (as long as it's preceded by escaped whitespace)

Fixed. See code below.

decoding:
- doesn't ignore trailing literal whitespace

Well, I don't think that's much of an issue as I'm not sure when trailing whitespace would be prepended to lines, but I've fixed it anyway.

Here's the new code:

  def encode(text, also_encode = "")
    text.gsub(/[\t ](?:[\v\t ]|$)|[=\x00-\x08\x0B-\x1F\x7F-\xFF#{also_encode}]/) do |char|
      char[0 ... -1] + "=%02X" % char[-1]
    end.gsub(/^(.{75})(.{2,})$/) do |match|
      base, continuation = $1, $2
      continuation = base.slice!(/=(.{0,2})\Z/).to_s + continuation
      base + "=\n" + continuation
    end.gsub("\n", "\r\n")
  end

  def decode(text, allow_lowercase = false)
    encoded_re = Regexp.new("=([0-9A-F]{2})", allow_lowercase ? "i" : "")
    text.gsub("\r\n", "\n").gsub("=\n", "").gsub(encoded_re) do
      $1.to_i(16).chr
    end
  end

I'll repost the full source when I've sorted out that other problem as well.

"Patrick Hurley" <phurley@gmail.com> continued:

Thanks for the kind response.

When I said the test case failed, I meant the actually output our
resulting output encodeing the line has trailing space at the end of a
line. We both escape trailing spaces before we break lines - if the
line breaking moves some code is that not an issue? (the continuation
= might mean that it is not).

From the RFC (2045, section 6.7):
          Any TAB (HT) or SPACE characters
          on an encoded line MUST thus be followed on that line
          by a printable character. In particular, an "=" at the
          end of an encoded line, indicating a soft line break
          (see rule #5) may follow one or more TAB (HT) or SPACE
          characters.

So it's all good - unescaped tabs and spaces are fine as long as it's got a printable non-whitespace character after it, and "=" is fine for that.

          ... Therefore, when decoding a Quoted-Printable
          body, any trailing white space on a line must be
          deleted, as it will necessarily have been added by
          intermediate transport agents.

There's something I think we've all forgotten to do -- strip trailing unescaped whitespace. I've added the following test:

  def test_decode_strip_trailing_space
    assert_equal(
      "The following whitespace must be ignored: \r\n".from_quoted_printable,
      "The following whitespace must be ignored:\n")
  end

And the following line to decode_string:
      result.gsub!(/[\t ]+(?=\r\n|$)/, '')

Yup there was an issue with masks I fixed that and removed the globals
(my perl just throwing in a $ when in doubt :slight_smile: There was also a bug
in the command line driver, which I have fixed. The patched code
follows

(/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
makes you look like a Perl 5 junkie,

I did this to allow the use of a gsub, which is much faster than the
looping solution. The look aheads and general uglyness handle the
special cases. I probably should use /x and space it out and comment,
but when I am in the regexp zone, I know what I am typing <grin>.

Write-only? No, I'm not in a fantastic position to comment, mine is not that much shorter.

...
def QuotedPrintable.decode
   STDIN.binmode
   while (line = gets) do
     # I am a ruby newbie, and I could
     # not get gets to get the \r\n pairs
     # no matter how I set $/ - any pointers?

C:\WINDOWS>ruby
STDIN.binmode
gets.each_byte do |b| puts b end
^Z

13
10

Seems to work for me - that output says I wouldn't need the following line

     line = line.chomp + "\r\n"

Cheers,
Dave

"Florian Gross" <flgr@ccan.de> responded

Dave Burt wrote:

Hi Florian,

Moin Dave.

As always, I'm amazed by your concise code. But your solution seems to be
failing a bunch of my tests (and not just by chopping lines early, which
is allowed):

Thanks, I'll have a look.

encoding:
- escapes mid-line whitespace

I'm not sure I get this. Am I incorrectly escaping mid-line whitespace or
am I incorrectly not escaping it? And what is mid-line whitespace?

Tabs and spaces that are followed by something printable on the same line
should not be escaped; see the following:

  5) Failure:
test_encode_12(TC_QuotedPrintable) [(eval):2]:
<"=3D=3D=3D
=\r\n =20\r\n"> expected but was
<"=3D=3D=3D=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
=\r\n=20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20
=20 =20 =20 =20 =20 =20 =20 =20\r\n">.

- escapes '~'

Heh, classic off-by-one. Easily fixed by changing the Regexp. See source
below.

Too easy :slight_smile:

- allows too-long lines (my tests saw up to 104 characters on a line)

Any hints on when this is happening? I can't see why and when this would
happen.

test_encode_12 also demonstrates this. I fixed it by changing
/[\t ](?:[\v\t ]|$)../ to /[\t ]$../.
This (obviously) fixes the mid-line whitespace as well.

- allows unescaped whitespace on the end of a line (as long as it's
preceded by escaped whitespace)

Fixed. See code below.

decoding:
- doesn't ignore trailing literal whitespace

Well, I don't think that's much of an issue as I'm not sure when trailing
whitespace would be prepended to lines, but I've fixed it anyway.

It's not mentioned in the quiz question, although you can infer that it is
illegal from the quiz question. The idea is that if there is trailing
whitespace, it has been added in transit and should be removed (it's not
actually part of the data that was encoded).

Also, this, on line 10: "char[0 ... -1] + ...", seems redundant - with char
as a one-character match, it's an empty string.

Here's the new code:

  <snip>

I'll repost the full source when I've sorted out that other problem as
well.

Cheers,
Dave

Thanks for the update on the RFC, guess I should have just read that myself.

Well I don't want to "litter" the news group, but I hate to have
incorrect code out there with my name on it so. If you want follow the
link (http://hurleyhome.com/~patrick/quiz23.rb\) to see the fixed code.
Also of note is the now commented (just for Dave) regexp for parsing
long lines, for the curious:

    lines = result.scan(/
      # Match one of the three following cases
      (?:
      # This will match the special case of an escape that would generally have
        # split across line boundries
        (?: [^\n]{74}(?==[\dA-F]{2}) ) |
      # This will match the case of a line of text that does not need to split
        (?: [^\n]{0,76}(?=\n) ) |
      # This will match the case of a line of text that needs to be
split without special adjustment
        (?:[^\n]{1,75}(?!\n{2}))
      )
      # Match zero or more newlines
      (?-x:#{$/.}*)/x);

pth

···

On Wed, 16 Mar 2005 05:40:15 +0900, Dave Burt <dave@burt.id.au> wrote:

"Patrick Hurley" <phurley@gmail.com> continued:
> Thanks for the kind response.
>
> When I said the test case failed, I meant the actually output our
> resulting output encodeing the line has trailing space at the end of a
> line. We both escape trailing spaces before we break lines - if the
> line breaking moves some code is that not an issue? (the continuation
> = might mean that it is not).

From the RFC (2045, section 6.7):
          Any TAB (HT) or SPACE characters
          on an encoded line MUST thus be followed on that line
          by a printable character. In particular, an "=" at the
          end of an encoded line, indicating a soft line break
          (see rule #5) may follow one or more TAB (HT) or SPACE
          characters.

So it's all good - unescaped tabs and spaces are fine as long as it's got a
printable non-whitespace character after it, and "=" is fine for that.

          ... Therefore, when decoding a Quoted-Printable
          body, any trailing white space on a line must be
          deleted, as it will necessarily have been added by
          intermediate transport agents.

There's something I think we've all forgotten to do -- strip trailing unescaped
whitespace. I've added the following test:

  def test_decode_strip_trailing_space
    assert_equal(
      "The following whitespace must be ignored: \r\n".from_quoted_printable,
      "The following whitespace must be ignored:\n")
  end

And the following line to decode_string:
      result.gsub!(/[\t ]+(?=\r\n|$)/, '')

>
> Yup there was an issue with masks I fixed that and removed the globals
> (my perl just throwing in a $ when in doubt :slight_smile: There was also a bug
> in the command line driver, which I have fixed. The patched code
> follows
>
>> (/(?:(?:[^\n]{74}(?==[\dA-F]{2}))|(?:[^\n]{0,76}(?=\n))|(?:[^\n]{1,75}(?!\n{2})))(?:#{$/}*)/)
>> makes you look like a Perl 5 junkie,
>
> I did this to allow the use of a gsub, which is much faster than the
> looping solution. The look aheads and general uglyness handle the
> special cases. I probably should use /x and space it out and comment,
> but when I am in the regexp zone, I know what I am typing <grin>.

Write-only? No, I'm not in a fantastic position to comment, mine is not that
much shorter.

> ...
> def QuotedPrintable.decode
> STDIN.binmode
> while (line = gets) do
> # I am a ruby newbie, and I could
> # not get gets to get the \r\n pairs
> # no matter how I set $/ - any pointers?

> C:\WINDOWS>ruby
> STDIN.binmode
> gets.each_byte do |b| puts b end
> ^Z
>
> 13
> 10
>
Seems to work for me - that output says I wouldn't need the following line

> line = line.chomp + "\r\n"

Cheers,
Dave

Dave Burt wrote:

Tabs and spaces that are followed by something printable on the same line should not be escaped; see the following:

Oh, I misread the quiz -- I thought I would have to escape multiple whitespace characters as well which seemed to also make sense. (XML can sometimes collapse multiple whitespace characters into one.)

This in fact simplifies my algorithm. Note that the new code will also escape at the file end in case there is no new line character there. I think that that is actually a good thing.

- allows too-long lines (my tests saw up to 104 characters on a line)

Any hints on when this is happening? I can't see why and when this would happen.

test_encode_12 also demonstrates this. I fixed it by changing /[\t ](?:[\v\t ]|$)../ to /[\t ]$../.
This (obviously) fixes the mid-line whitespace as well.

Still not sure why this was happening. Line breaks should be applied after the escaping has already happened. But I guess it's fixed now anyway.

Also, this, on line 10: "char[0 ... -1] + ...", seems redundant - with char as a one-character match, it's an empty string.

It wasn't when I was escaping all repeated whitespace. It now is.

I'll repost the full source when I've sorted out that other problem as well.

Which is now. See attachment.

I've also decided to do proper escaping of the "[" and "]" characters in the also_encode argument. Note that it's still possible to cause an invalid Regexp by supplying b-a which I think is okay.

quoted_printable.rb (2.41 KB)