Extract data from email - Tmail, Hpricot

Hi all,

I have an html email that I would like to parse.

The problem I'm having is removing all html tags and getting past the
header information. Then I want to extract all the information per row
to put into a database.

the email is pasted here: http://pastie.textmate.org/265259

I have tried Tmail, but can't seem to extract just the body. Then I
tried Hpricot and wasn't sure what to use before the .inner_html. So
basically I'm very lost on where to start.

Any help is appreciated.

Thanks!

···

--
Posted via http://www.ruby-forum.com/.

George Cooper wrote:

Hi all,

I have an html email that I would like to parse.

The problem I'm having is removing all html tags and getting past the
header information. Then I want to extract all the information per row
to put into a database.

the email is pasted here: http://pastie.textmate.org/265259

I have tried Tmail, but can't seem to extract just the body. Then I
tried Hpricot and wasn't sure what to use before the .inner_html. So
basically I'm very lost on where to start.

Any help is appreciated.

Thanks!

It would help if you posted some code that didn't work, so people can have a better idea of what you're trying to do. Tmail should have been able to parse that without problem, however, extracting the body is easy. The box follows the empty line. You could use something like split, but duping such huge strings could be slow. When you read the mail, try to read a line at a time until you get the empty line, then read the rest into a buffer for hpricot.

···

--
Michael Morin
Guide to Ruby

Become an About.com Guide: beaguide.about.com
About.com is part of the New York Times Company

Michael Morin wrote:

George Cooper wrote:

I have tried Tmail, but can't seem to extract just the body. Then I
tried Hpricot and wasn't sure what to use before the .inner_html. So
basically I'm very lost on where to start.

Any help is appreciated.

Thanks!

It would help if you posted some code that didn't work, so people can
have a better idea of what you're trying to do. Tmail should have been
able to parse that without problem, however, extracting the body is
easy. The box follows the empty line. You could use something like
split, but duping such huge strings could be slow. When you read the
mail, try to read a line at a time until you get the empty line, then
read the rest into a buffer for hpricot.

Below is the code I am using to try and get the body out of the html
email (copy of email http://pastie.org/265259\) .

require 'rubygems'
require 'tmail'

email = TMail::Mail.load( 'emailhtml.eml' )

puts email['body'] # comes back nil
puts email['from']
puts email['Delivered-To']
puts email['to'] # comes back nil
puts email['subject']
puts email['date']
puts email['X-Originalarrivaltime']

results:
nil
user2@sender2.com
geocooper@gmail.com
nil
[Freddy] New Incidents captured on 2008-09-02
Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

···

--
Posted via http://www.ruby-forum.com/\.

Michael Morin wrote:
> George Cooper wrote:
>> I have tried Tmail, but can't seem to extract just the body. Then I
>> tried Hpricot and wasn't sure what to use before the .inner_html. So
>> basically I'm very lost on where to start.

>> Any help is appreciated.

>> Thanks!

> It would help if you posted some code that didn't work, so people can
> have a better idea of what you're trying to do. Tmail should have been
> able to parse that without problem, however, extracting the body is
> easy. The box follows the empty line. You could use something like
> split, but duping such huge strings could be slow. When you read the
> mail, try to read a line at a time until you get the empty line, then
> read the rest into a buffer for hpricot.

Below is the code I am using to try and get the body out of the html
email (copy of emailhttp://pastie.org/265259) .

require 'rubygems'
require 'tmail'

email = TMail::Mail.load( 'emailhtml.eml' )

puts email['body'] # comes back nil

Don't see why it would be nil. I would contact Mikel.

  http://lindsaar.net/

puts email['from']
puts email['Delivered-To']
puts email['to'] # comes back nil

I don't see a 'to' in the header, so is this a surprise?

puts email['subject']
puts email['date']
puts email['X-Originalarrivaltime']

results:
nil
us...@sender2.com
geocoo...@gmail.com
nil
[Freddy] New Incidents captured on 2008-09-02
Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

T.

···

On Sep 8, 11:31 am, Geo _C <geocoo...@gmail.com> wrote:

I got Tmail to extract the body of my email. The solution (very simple
and embarrassing) is below. Now I'm trying to figure out Hpricot, but
examples seem to be fairly thin. If anyone knows of a good tutorial for
beginners, please post. I have been using
http://code.whytheluckystiff.net/doc/hpricot/ , but could use something
more basic.

Thanks for the help!

Thomas Sawyer wrote:

> It would help if you posted some code that didn't work, so people can
require 'rubygems'
require 'tmail'

email = TMail::Mail.load( 'emailhtml.eml' )

puts email['body'] # comes back nil

Don't see why it would be nil. I would contact Mikel.

I needed to use email.body instead of email['body'] to return the body.
thanks Peter!

  http://lindsaar.net/

puts email['from']
puts email['Delivered-To']
puts email['to'] # comes back nil

I don't see a 'to' in the header, so is this a surprise?

My mistake there. You are correct, there is no 'to' for me to use.

···

On Sep 8, 11:31�am, Geo _C <geocoo...@gmail.com> wrote:

Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

T.

--
Posted via http://www.ruby-forum.com/\.