Parsing/decoding mail/news headers

Hi…

Just trying to put my Ruby skills to useful work - automatic handling,
sorting, indexing of mail/news archives.

Question 1:

In RMail in found header/address parsing functions (nice stuff!), but I
still havn’t figured out how to decode the header parts which have been
encoded for 7-bit(?) compatibility,

e.g. given

···

Subject: =?ISO-8859-1?Q?Erg=E4nzung?=

should decode to
Subject: Ergänzung

(that’s a german word)
I guess this has something to do with quoted-printable or MIME - are
there some lib functions which do this or how do you do it?

Question 2:
I stumbled upon:

irb(main):001:0> require “time”
=> true
irb(main):002:0> Time.parse (“01 Apr 2003 23:38:05 +0200”)
=> Tue Apr 01 23:38:05 CEST 2003
irb(main):003:0> Time.parse (“01 Apr 2033 23:38:05 +0200”)
=> Fri Apr 01 23:38:05 CEST 2033
irb(main):004:0> Time.parse (“01 Apr 2043 23:38:05 +0200”)
ArgumentError: gmtime error
from /usr/local/lib/ruby/1.8/time.rb:162:in 'utc’
from /usr/local/lib/ruby/1.8/time.rb:162:in 'parse’
from (irb):4

Large dates are not supported - I guess this has to do with the unix
32bit seconds time thing - unfortunately some spam has pretty crap
in header…
Catching the exception and ignoring that piece of mail/news is current
option, any better?

ruby -version
ruby 1.8.0 (2003-03-03) [i686-linux]

thanks for some hints,
Martin

I believe this is on the RMail todo list, but if you want to have a go at doing it yourself, the format is defined in RFC2047 - It’s a MIME “encoded-word”.

I started playing with something to decode these a couple of weeks back, a partial solution on top of RMail using some (fairly) simple parsing code and iconv (or in my case ruby/gnome2 :) isn't too difficult.

I'd be interested if anyone else has a complete solution to this (i.e. deals with all the places in the various headers where you could possibly have this type of encoding, as well as doing the actual decoding and checking that only allowed characters are used in "Q" encodings depending on the header, etc etc etc)

My solution is far too messy, and, well, nasty to share atm, but I'll probably be going back to it and tidying it up over the next few weeks for an MUA I'm playing with writing.
···

On Wed, 2 Apr 2003 07:48:37 +0900 Martin Pirker crf@sbox.tu-graz.ac.tgtegegerfrwfdweeeedef.at wrote:

In RMail in found header/address parsing functions (nice stuff!), but
I still havn’t figured out how to decode the header parts which have
been encoded for 7-bit(?) compatibility,

e.g. given
Subject: =?ISO-8859-1?Q?Erg=E4nzung?=

should decode to
Subject: Ergänzung

(that’s a german word)
I guess this has something to do with quoted-printable or MIME - are
there some lib functions which do this or how do you do it?


Stephen Lewis
slewis@paradise.net.nz

Saluton!

In RMail in found header/address parsing functions (nice stuff!), but I
still havn’t figured out how to decode the header parts which have been
encoded for 7-bit(?) compatibility,

ASCII that is.

e.g. given
Subject: =?ISO-8859-1?Q?Erg=E4nzung?=

should decode to
Subject: Ergänzung

(that’s a german word)

Synonym: Addendum

If you want to convert this to ordinary Latin-1 text:
If header matches ‘=?ISO-8859-1?Q?’ (read in text up to terminating
‘=’) then repace =AF by \0xAF. Quick and dirty but works. To a lesser
degree this also works with Codepage-1252 (Windows) and Latin 9
(ISO-8859-15). CP1252 defines characters not defined in ISO charsets
whereas ISO-8859-15 replaces some of the ISO-8859-1 characters. See
‘Character Encoding’ on my homepage.

I have tables that show what Hexcodes map to which Unicode number for
quite a lot of character sets. I don’t have any CJK, sorry but I at
least have VISCII.

Gis,

Josef ‘Jupp’ Schugt http://jupp.tux.nu jupp(AT)gmx(DOT)de

···


“If this were a dictatorship, it’d be a heck of a lot easier…just
as long as I’m the dictator…” – George W. Bush on 2000-12-18 at
Washington, District of Columbia

[some hours later]

my quick&dirty hack:

def decode(str)
str = str.gsub(/=?iso-8859-(1|15)?q?(.?)?=/i){|x|
x.gsub(/=?iso-8859-(1|15)?q?/i,“”) [0…-3].unpack("M
").to_s.tr(““,” “)}
str = str.gsub(/=?iso-8859-(1|15)?b?(.?)?=/i){|x|
x.gsub(/=?iso-8859-(1|15)?b?/i,“”) [0…-3].unpack("m
”).to_s.tr("
”," ")}
str
end

not a perfect conversion, but at least -1 and -15, Q and B cases
are readable and that’s good enough for me

the Ruby experts may suggest now a way to roll both into one :wink:

uhhhh… nearly 4am, better go to bed now,
Martin

···

Josef ‘Jupp’ Schugt jupp@gmx.de wrote:

If you want to convert this to ordinary Latin-1 text:
If header matches ‘=?ISO-8859-1?Q?’ (read in text up to terminating
‘=’) then repace =AF by \0xAF. Quick and dirty but works.

I still don’t class myself as a Ruby expert :slight_smile: But I do notice the
duplicated gsubs which are easy to remove because $2 contains whatever
string matched the second parenthesised expression:

str = str.gsub(/=?iso-8859-(1|15)?q?(.?)?=/i){|x|
$2.unpack("M
").to_s.tr(“_”," ")
}

Then you can combine the two if you like by selecting the appropriate value
for unpack:

RFC2047_UNPACK = { ‘q’ => ‘M*’, ‘b’ => ‘m*’ }
def decode(str)
str.gsub(/=?iso-8859-(1|15)?(b|q)?(.*?)?=/i){|x|
$3.unpack(RFC2047_UNPACK[$2.downcase]).to_s.tr(“_”," ")
}
end

Regards,

Brian.

···

On Sun, Apr 06, 2003 at 10:45:02AM +0900, Martin Pirker wrote:

[some hours later]

my quick&dirty hack:

def decode(str)
str = str.gsub(/=?iso-8859-(1|15)?q?(.?)?=/i){|x|
x.gsub(/=?iso-8859-(1|15)?q?/i,“”) [0…-3].unpack("M
").to_s.tr(““,” “)}
str = str.gsub(/=?iso-8859-(1|15)?b?(.?)?=/i){|x|
x.gsub(/=?iso-8859-(1|15)?b?/i,“”) [0…-3].unpack("m
”).to_s.tr("
”," ")}
str
end

not a perfect conversion, but at least -1 and -15, Q and B cases
are readable and that’s good enough for me

the Ruby experts may suggest now a way to roll both into one :wink: