RMail and RFC-2047

I was playing around with the RMail package and I was missing RFC-2047
but noticed the following:

In the regex to discover encoded words:

WORD = %r{=?([!#$%&'*±/0-9A-Z\^`a-z{|}~]+)?([BbQq])?([!->@-~]+)?=} # :nodoc:

I had to change % to % to run. Maybe it’s just Cygwin.

The second thing is that the module doesn’t correctly interpret the
“encoded-word - linear white space - encoded word” sequence, where
all the white space should be deleted.

So I added a regex to delete this whitespace before further processing:

module Rfc2047

WORD = %r{=?([!#$%&'*±/0-9A-Z\^`a-z{|}~]+)?([BbQq])?([!->@-~]+)?=} # :nodoc:

WORDSEQ = %r{(=?[!#$%&‘±/0-9A-Z\^`a-z{|}~]+?[BbQq]?[!->@-~]+?=)\s(=?[!#$%&’*±/0-9A-Z\^`a-z{|}~]+?[BbQq]?[!->@-~]+?=)}

[Comment skipped]

def Rfc2047.decode_to(target, from)

from.gsub!(WORDSEQ, '\1\2')
out = from.gsub(WORD) do
  >word>
  charset, encoding, text = $1, $2, $3

It works so far, but I wonder whether ‘\s*’ is the correct expression
and whether there is a more efficient way to do this.

I also observed that decoding of non-Western character sets (Win-1251
to
Big5) to UTF-8 didn’t work. Does anybody already suspect why or do I
have
to track down the error further?

···

support. I found the “module Rfc2047” in <20031204151316.GC849@jupp%gmx.de>

Oliver Cromm

Quoteing c1205@er.uqam.ca, on Fri, May 28, 2004 at 10:01:43PM +0900:

I was playing around with the RMail package and I was missing RFC-2047
support. I found the "module Rfc2047" in
<20031204151316.GC849@jupp%gmx.de>

Probably the one I wrote.

but noticed the following:

In the regex to discover encoded words:

> WORD = %r{=\?([!#$%&'*+-/0-9A-Z\\^\`a-z{|}~]+)\?([BbQq])\?([!->@-~]+)\?=} # :nodoc:

I had to change % to \% to run. Maybe it's just Cygwin.

Looks like you are using ruby1.8. There's lots of warnings, too. I'll
fix it sometime, or you can send me a patch? :slight_smile:

The second thing is that the module doesn't correctly interpret the
"encoded-word - linear white space - encoded word" sequence, where
all the white space should be deleted.

So I added a regex to delete this whitespace before further processing:

> module Rfc2047
>
> WORD = %r{=\?([!#$\%&'*+-/0-9A-Z\\^\`a-z{|}~]+)\?([BbQq])\?([!->@-~]+)\?=} # :nodoc:
>> WORDSEQ = %r{(=\?[!#$\%&'*+-/0-9A-Z\\^\`a-z{|}~]+\?[BbQq]\?[!->@-~]+\?=)\s*(=\?[!#$\%&'*+-/0-9A-Z\\^\`a-z{|}~]+\?[BbQq]\?[!->@-~]+\?=)}

Two comments:

1 - I don't think this will work. It will fix:

  encoded-word - linear white space - encoded word

but not:

   encoded-word - linear white space - encoded word - linear white space -
   encoded word

I.e, it only does pairs, so I don't think it does what you want.

2 - it will trash your input argument, which is fairly undesireable

I think you could do the match with a regex by using some-kind of regex
operator that matched a WORD, but didn't consume it. See below, I don't
have time to test it thoroughly, just one test case, but maybe it will
work for you. If it does, I think I could rewrite the regex to do this
in a single sweep, though I don't see efficiency as a concern, we're
talking mail headers here, they aren't that big!

I also observed that decoding of non-Western character sets (Win-1251
to Big5) to UTF-8 didn't work. Does anybody already suspect why or do
I have to track down the error further?

Which version of rfc2047.rb do you have? I'm at 1.4, and it has a fix
for this, I believe, see below.

Sam

# $Id: rfc2047.rb,v 1.4 2003/04/18 20:55:56 sam Exp $

···

#
# An implementation of RFC 2047 decoding.
#
# This module depends on the iconv library by Nobuyoshi Nakada, which I've
# heard may be distributed as a standard part of Ruby 1.8. Many thanks to him
# for helping with building and using iconv.
#
# Thanks to "Josef 'Jupp' Schugt" <jupp@gmx.de> for pointing out an error with
# stateful character sets.
#
# Copyright (c) Sam Roberts <sroberts@uniserve.com> 2004
#
# This file is distributed under the same terms as Ruby.

require 'iconv'

module Rfc2047

  WORD = %r{=\?([!#$%&'*+-/0-9A-Z\\^\`a-z{|}~]+)\?([BbQq])\?([!->@-~]+)\?=} # :nodoc:
  WORDSEQ = %r{(#{WORD.source})\s+(?=#{WORD.source})}

  # Decodes a string, +from+, containing RFC 2047 encoded words into a target
  # character set, +target+. See iconv_open(3) for information on the
  # supported target encodings. If one of the encoded words cannot be
  # converted to the target encoding, it is left in its encoded form.
  def Rfc2047.decode_to(target, from)
    from = from.gsub(WORDSEQ, '\1')
    out = from.gsub(WORD) do
      >word>
      charset, encoding, text = $1, $2, $3
      
      # B64 or QP decode, as necessary:
      case encoding
        when 'b', 'B'
          #puts text
          text = text.unpack('m*')[0]
          #puts text.dump

        when 'q', 'Q'
          # RFC 2047 has a variant of quoted printable where a ' ' character
          # can be represented as an '_', rather than =32, so convert
          # any of these that we find before doing the QP decoding.
          text = text.tr("_", " ")
          text = text.unpack('M*')[0]

        # Don't need an else, because no other values can be matched in a
        # WORD.
      end

      # Convert:
      #
      # Remember - Iconv.open(to, from)!
      begin
        text = Iconv.iconv(target, charset, text).join
        #puts text.dump
      rescue Errno::EINVAL, Iconv::IllegalSequence
        # Replace with the entire matched encoded word, a NOOP.
        text = word
      end
    end
  end
end