Rake task dependencies: via timestamps table in database

Hi all,

There is some interest in the bioinformatics community for using rake
as a workflow tool (see e.g. http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/).
Rake could be ideal for this type of work: a typical workflow will
take data and perform a first set of conversions on it (i.e. a task),
followed by a second set of conversions (that is dependent on the
first task), and so on.

However, bioinformaticians try to keep their data in databases rather
than files. And we found we need some workarounds to get dependencies
working. Does anyone know if it would be very difficult to add
functionality to rake to check a meta table in a database for
timestamps of tasks rather than looking at timestamps of files? I was
thinking of a table looking like the one below:

table: meta
task
modified_on

···

==============================================
001_load_data
20080602_0831
002_calculate_averages 20080602_0845
003_make_histogram_of_averages 20080602_0851

The rakefile would then contain:

task :001_load_data do
   <do stuff>
   <automatically update record in meta table>
end

task :002_calculate_averages => [:001_load_data] do
   <do stuff>
   <automatically update record in meta table>
end

task :003_make_histogram_of_averages => [:002_calculate_averages] do
   <do stuff>
   <automatically update record in meta table>
end

So if we had reloaded the data (001), then the timestamp for that task
in the meta table would be later than the one for task 002. As a
result, task 002 would automatically have to be rerun if we were to
run task 003.

I'd very much like to know if anyone has an idea how rake can be
extended this way. Basically, the dependency checker has to be
extended to look into a fixed table in a database...

Many thanks,
Jan Aerts

-

Dr Jan Aerts
Senior Bioinformatician
Genome Dynamics and Evolution Group
Wellcome Trust Sanger Institute
Hinxton
Cambridge CB10 1SA
UK

phone: +44 (0)1223 - 494732
web: http://www.sanger.ac.uk/Teams/Team29/

Hi Jan, if you look at the source code of rake's FileTask, you'll see
that this shouldn't be very difficult. The code consists of only four
methods and is easy to read. Feel free to ask again if you have more
questions.

Regards,
Pit

···

2008/6/6 jandot <jan.aerts@gmail.com>:

There is some interest in the bioinformatics community for using rake
as a workflow tool (...)
However, bioinformaticians try to keep their data in databases rather
than files. And we found we need some workarounds to get dependencies
working. Does anyone know if it would be very difficult to add
functionality to rake to check a meta table in a database for
timestamps of tasks rather than looking at timestamps of files?

Thanks for that pointer, Pit. I think I got quite far now based on
FileTask. But something is still wrong. The trouble is that I have no
idea where, so can't really ask specific questions...
It looks like the block passed to a task is not executed.

I've put what I already have on github: GitHub - jandot/biorake: An extension to rake that can be used to build database-backed workflows

There's a sample directory with an example Rakefile that should work
once the extension is fixed. In addition, there are two test suites
copied from the file tests. Unfortunately, many of the tests still
fail.

If anybody could have a look at the tests and help to get them
running, I would be very thankfull.

Cheers,
jan.

···

On Jun 9, 9:17 am, Pit Capitain <pit.capit...@gmail.com> wrote:

2008/6/6 jandot <jan.ae...@gmail.com>:

> There is some interest in the bioinformatics community for using rake
> as a workflow tool (...)
> However, bioinformaticians try to keep their data in databases rather
> than files. And we found we need some workarounds to get dependencies
> working. Does anyone know if it would be very difficult to add
> functionality to rake to check a meta table in a database for
> timestamps of tasks rather than looking at timestamps of files?

Hi Jan, if you look at the source code of rake's FileTask, you'll see
that this shouldn't be very difficult. The code consists of only four
methods and is easy to read. Feel free to ask again if you have more
questions.

Regards,
Pit