Need Advice Help On Parsing A File

Hello,

I'm creating an application that will parse mbox files, extract the
data, and put it into a db. I have a couple of problems. For those of
you who are not familiar with mbox files, just think of one text file
that stores all of the emails in text format.

1) mbox files keep updating so how do notify my script that new data has
come in? Do I rerun the script with a placeholder where it last
finished? That would require me to rescan the whole mbox file to find
the placeholder which is pretty bad design.

2) What is the most efficient way to read the emails into memory before
putting it into a db? Since there are multiple emails in each mbox file
will I just read one of the emails, store it into memory, dumb it into
db, then replace the current email in memory with the new one?

Thank you for all of your help.

ps. I realize there are scripts that already do this. I'm creating this
for my own learning experience.

···

--
Posted via http://www.ruby-forum.com/.

I'm creating an application that will parse mbox files, extract the
data, and put it into a db. I have a couple of problems. For those of
you who are not familiar with mbox files, just think of one text file
that stores all of the emails in text format.

1) mbox files keep updating so how do notify my script that new data has
come in? Do I rerun the script with a placeholder where it last
finished? That would require me to rescan the whole mbox file to find
the placeholder which is pretty bad design.

It would probably be more efficient to just store the file position (see IO#tell and IO#seek). However, that does not work if the mbox file is not just appended to but also removed from. And the file position approach also does not solve the problem of concurrent modifications.

I assume for mbox files there is an established protocol how the file locking works. You'll probably find it somewhere on the web or someone more knowledgeable than me posts information here.

2) What is the most efficient way to read the emails into memory before
putting it into a db? Since there are multiple emails in each mbox file
will I just read one of the emails, store it into memory, dumb it into
db, then replace the current email in memory with the new one?

Yes, processing one email at a time seems the most efficient approach, especially since there is no information about other mails that you need to store a single email.

ps. I realize there are scripts that already do this. I'm creating this
for my own learning experience.

Have fun!

Kind regards

  robert

···

On 28.06.2009 02:26, Mrmaster Mrmaster wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Hello,

I'm creating an application that will parse mbox files, extract the
data, and put it into a db. I have a couple of problems. For those of
you who are not familiar with mbox files, just think of one text file
that stores all of the emails in text format.

1) mbox files keep updating so how do notify my script that new data has
come in? Do I rerun the script with a placeholder where it last
finished? That would require me to rescan the whole mbox file to find
the placeholder which is pretty bad design.

Is it bad design? What happens when the user deletes the first email in the mbox?

2) What is the most efficient way to read the emails into memory before
putting it into a db? Since there are multiple emails in each mbox file
will I just read one of the emails, store it into memory, dumb it into
db, then replace the current email in memory with the new one?

efficient? why do you care about efficiency already? Get something working first, worry about efficiency AFTER you've measured stuff, not blindly guessed. Oh... and measure only once you have efficiency issues, until then it is Fast Enough(tm) (cousin of Just Works(tm)).

Luckily in this case, one of the most efficient (time wise) is also the cleanest (code wise):

   File.read(path) #=> contents

···

On Jun 27, 2009, at 17:26 , Mrmaster Mrmaster wrote:

I'm creating an application that will parse mbox files, extract the
data, and put it into a db.

I'd address this on several levels.

First, mbox is not a good idea. Go with maildir or IMAP. Even better, write it
as a "forward" script -- most modern mailservers allow you to configure a
"forward" address to be a script, rather than an email address. Every time a
new message comes in, it runs the script, piping the message to it over
standard input.

If you end up doing mbox, maildir, or IMAP, then:

1) mbox files keep updating so how do notify my script that new data has
come in?

With IMAP or POP3, I'm fairly sure you just have to poll.

With mbox or maildir, there's probably some sort of library to watch a file or
a directory for changes. This is more efficient and responsive, but harder to
do.

Do I rerun the script with a placeholder where it last
finished? That would require me to rescan the whole mbox file to find
the placeholder which is pretty bad design.

Well, you could seek, as others have said...

That works only if no one ever deletes anything. If someone does, I'm really
not sure how you would even know whether a given message was in the database
already. There might be a header you could look for, but that would require
you to, as you said, rescan the whole file.

Probably the best solution, if your script is the only thing processing that
file, is to lock it (however your mailserver supports that) and rename it out
of the way as you process it. That way, any new messages will come into a
brand new mbox file.

2) What is the most efficient way to read the emails into memory before
putting it into a db? Since there are multiple emails in each mbox file
will I just read one of the emails, store it into memory, dumb it into
db, then replace the current email in memory with the new one?

Probably something like that. If you use a library like TMail, it will
probably take care of using a temporary file to store the mail if it gets too
big -- but it also understands mbox already, so it may be doing more than you
want.

···

On Saturday 27 June 2009 07:26:40 pm Mrmaster Mrmaster wrote:

Unfortunately, no. It is server (both smtp and imap) and OS dependent. sucks.

···

On Jun 28, 2009, at 01:05 , Robert Klemme wrote:

I assume for mbox files there is an established protocol how the file locking works. You'll probably find it somewhere on the web or someone more knowledgeable than me posts information here.

Hello,

I'm creating an application that will parse mbox files, extract the
data, and put it into a db. I have a couple of problems. For those of
you who are not familiar with mbox files, just think of one text file
that stores all of the emails in text format.

1) mbox files keep updating so how do notify my script that new data has
come in? Do I rerun the script with a placeholder where it last
finished? That would require me to rescan the whole mbox file to find
the placeholder which is pretty bad design.

Is it bad design? What happens when the user deletes the first email in the mbox?

2) What is the most efficient way to read the emails into memory before
putting it into a db? Since there are multiple emails in each mbox file
will I just read one of the emails, store it into memory, dumb it into
db, then replace the current email in memory with the new one?

efficient? why do you care about efficiency already? Get something working first, worry about efficiency AFTER you've measured stuff, not blindly guessed. Oh... and measure only once you have efficiency issues, until then it is Fast Enough(tm) (cousin of Just Works(tm)).

I don't fully agree. It does not hurt to waste a quick thought about efficiency here since you know already that mbox files can grow large. Someone archiving his complete email history in and never deleting anything from a single mbox file will reach memory limits sooner or later when reading the whole file into memory at once. It may prove more efficient (developer time wise) to start with a solution that assumed to be more efficient right from the start (in this case, process a single message at a time because otherwise you might be stuck with a working solution that needs heavy refactoring leading to much higher efforts (with high likelyhood!) than doing the right thing initially.

Luckily, it is not too difficult. You could do something like

MBox = Struct.new :io
   include Enumerable
   include Enumerator

   def each
     msg = nil

     io.each do |line|
       if /^From/ =~ line
         yield msg if msg
         msg = line
       else
         msg << line
       end
     end

     self
   end
end

And then

File.open "mbox" do |io|
   MBox.new(io).each do |message|
     # deal with one message at a time
   end
end

Luckily in this case, one of the most efficient (time wise) is also the cleanest (code wise):

   File.read(path) #=> contents

How do you know without profiling the concrete application? :slight_smile:

Kind regards

  robert

···

On 28.06.2009 12:13, Ryan Davis wrote:

On Jun 27, 2009, at 17:26 , Mrmaster Mrmaster wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Hi Everyone,

Thank you for the great advice and especially on the hint of (IO#tell
and IO#seek).

Robert would you happen to know how I may tell if the file has been
updated/modified? No user would touch the script/email server. I just
need to know how to tell if new email have come in.

So far all i can think of is setting the script to run at a 10 or some
other second interval and starting from the last known position.

···

--
Posted via http://www.ruby-forum.com/.

Ryan Davis wrote:

I assume for mbox files there is an established protocol how the
file locking works. You'll probably find it somewhere on the web or
someone more knowledgeable than me posts information here.

Unfortunately, no. It is server (both smtp and imap) and OS dependent.
sucks.

There are a handful of variants (from top of head: flock locking, and
dot-file locking). The best way of checking is to look at the source
code of something which accesses mbox files, and see what they do. e.g.
an MTA like exim (which has very clear source code), or an MUA like mutt
or pine. All these are written in C.

Alternatively, if you have control over the problem domain, consider
changing to Maildir.

···

On Jun 28, 2009, at 01:05 , Robert Klemme wrote:

--
Posted via http://www.ruby-forum.com/\.

Depending on the way the file system's configured, you could probably get by on checking the file access timestamp or if not the file size. File::Stat is probably a good place to start looking.

On some platforms it's also possible to get the OS kernel to inform you when a filesystem change occurs (dnotify or inotify on linux, kqueue on BSDs, etc.) so you could also do a search for Ruby plug-ins that patch into those subsystems, but it's probably overkill for what you're trying to do.

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
http://www.linkedin.com/in/eleanormchugh

···

On 28 Jun 2009, at 18:44, Mrmaster Mrmaster wrote:

Hi Everyone,

Thank you for the great advice and especially on the hint of (IO#tell
and IO#seek).

Robert would you happen to know how I may tell if the file has been
updated/modified? No user would touch the script/email server. I just
need to know how to tell if new email have come in.

So far all i can think of is setting the script to run at a 10 or some
other second interval and starting from the last known position.

PS: Just a quick heads up: I just notice there is a subtle bug in the
code I presented (which was untested anyway). Sorry for any
inconvenience.

Kind regards

robert

···

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/