Extract data from email - Tmail, Hpricot

Geo_Cooper · 4 September 2008 14:57

Hi all,

I have an html email that I would like to parse.

The problem I'm having is removing all html tags and getting past the
header information. Then I want to extract all the information per row
to put into a database.

the email is pasted here: http://pastie.textmate.org/265259

I have tried Tmail, but can't seem to extract just the body. Then I
tried Hpricot and wasn't sure what to use before the .inner_html. So
basically I'm very lost on where to start.

Any help is appreciated.

Thanks!

···

--
Posted via http://www.ruby-forum.com/.

Michael_Morin · 4 September 2008 18:36

George Cooper wrote:

Hi all,

I have an html email that I would like to parse.

The problem I'm having is removing all html tags and getting past the
header information. Then I want to extract all the information per row
to put into a database.

the email is pasted here: http://pastie.textmate.org/265259

I have tried Tmail, but can't seem to extract just the body. Then I
tried Hpricot and wasn't sure what to use before the .inner_html. So
basically I'm very lost on where to start.

Any help is appreciated.

Thanks!

It would help if you posted some code that didn't work, so people can have a better idea of what you're trying to do. Tmail should have been able to parse that without problem, however, extracting the body is easy. The box follows the empty line. You could use something like split, but duping such huge strings could be slow. When you read the mail, try to read a line at a time until you get the empty line, then read the rest into a buffer for hpricot.

···

--
Michael Morin
Guide to Ruby

Become an About.com Guide: beaguide.about.com
About.com is part of the New York Times Company

Geo_Cooper · 8 September 2008 15:31

Michael Morin wrote:

George Cooper wrote:

I have tried Tmail, but can't seem to extract just the body. Then I
tried Hpricot and wasn't sure what to use before the .inner_html. So
basically I'm very lost on where to start.

Any help is appreciated.

Thanks!

It would help if you posted some code that didn't work, so people can
have a better idea of what you're trying to do. Tmail should have been
able to parse that without problem, however, extracting the body is
easy. The box follows the empty line. You could use something like
split, but duping such huge strings could be slow. When you read the
mail, try to read a line at a time until you get the empty line, then
read the rest into a buffer for hpricot.

Below is the code I am using to try and get the body out of the html
email (copy of email http://pastie.org/265259\) .

require 'rubygems'
require 'tmail'

email = TMail::Mail.load( 'emailhtml.eml' )

puts email['body'] # comes back nil
puts email['from']
puts email['Delivered-To']
puts email['to'] # comes back nil
puts email['subject']
puts email['date']
puts email['X-Originalarrivaltime']

results:
nil
user2@sender2.com
geocooper@gmail.com
nil
[Freddy] New Incidents captured on 2008-09-02
Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

···

--
Posted via http://www.ruby-forum.com/\.

7rans · 8 September 2008 22:13

Michael Morin wrote:
> George Cooper wrote:
>> I have tried Tmail, but can't seem to extract just the body. Then I
>> tried Hpricot and wasn't sure what to use before the .inner_html. So
>> basically I'm very lost on where to start.

>> Any help is appreciated.

>> Thanks!

> It would help if you posted some code that didn't work, so people can
> have a better idea of what you're trying to do. Tmail should have been
> able to parse that without problem, however, extracting the body is
> easy. The box follows the empty line. You could use something like
> split, but duping such huge strings could be slow. When you read the
> mail, try to read a line at a time until you get the empty line, then
> read the rest into a buffer for hpricot.

Below is the code I am using to try and get the body out of the html
email (copy of emailhttp://pastie.org/265259) .

require 'rubygems'
require 'tmail'

email = TMail::Mail.load( 'emailhtml.eml' )

puts email['body'] # comes back nil

Don't see why it would be nil. I would contact Mikel.

http://lindsaar.net/

puts email['from']
puts email['Delivered-To']
puts email['to'] # comes back nil

I don't see a 'to' in the header, so is this a surprise?

puts email['subject']
puts email['date']
puts email['X-Originalarrivaltime']

results:
nil
us...@sender2.com
geocoo...@gmail.com
nil
[Freddy] New Incidents captured on 2008-09-02
Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

T.

···

On Sep 8, 11:31 am, Geo _C <geocoo...@gmail.com> wrote:

Geo_Cooper · 11 September 2008 16:29

I got Tmail to extract the body of my email. The solution (very simple
and embarrassing) is below. Now I'm trying to figure out Hpricot, but
examples seem to be fairly thin. If anyone knows of a good tutorial for
beginners, please post. I have been using
http://code.whytheluckystiff.net/doc/hpricot/ , but could use something
more basic.

Thanks for the help!

Thomas Sawyer wrote:

> It would help if you posted some code that didn't work, so people can
require 'rubygems'
require 'tmail'

email = TMail::Mail.load( 'emailhtml.eml' )

puts email['body'] # comes back nil

Don't see why it would be nil. I would contact Mikel.

I needed to use email.body instead of email['body'] to return the body.
thanks Peter!

http://lindsaar.net/

puts email['from']
puts email['Delivered-To']
puts email['to'] # comes back nil

I don't see a 'to' in the header, so is this a surprise?

My mistake there. You are correct, there is no 'to' for me to use.

···

On Sep 8, 11:31�am, Geo _C <geocoo...@gmail.com> wrote:

Tue, 2 Sep 2008 19:05:00 -0400
02 Sep 2008 23:10:35.0578 (UTC) FILETIME=[1B2659A0:01C90D51]

T.

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Pop3 body email ruby-talk	4	101	17 February 2009
HTML parser Hpricot? and how to get all text ruby-talk	12	143	3 November 2007
[ask]How to remove HTML part of a text ruby-talk	3	128	26 December 2009
Email Parsing ruby-talk	5	122	13 May 2011
Hpricot problem ruby-talk	10	74	18 December 2006

Extract data from email - Tmail, Hpricot

Related topics