[ANN] bj-0.0.2

(note: i just pushed 0.0.2 up and the gem mirrors typically take ~ 30 minutes to sync - be sure you get 0.0.2!)

NAME
   bj

SYNOPSIS
   bj (migration_code|generate_migration|migrate|setup|run|submit|list>set>config>pid) [options]+

DESCRIPTION

···

________________________________
   Overview
   --------------------------------

     Backgroundjob (Bj) is a simple to use background priority queue for rails.
     Although not yet tested on windows, the design of bj is such that operation
     should be possible on any operating system, including M$.

     Jobs can be submitted to the queue directly using the api or from the
     commandline using the 'bj' script. For example

     code:
         Bj.submit 'cat /etc/password'

       cli:
         bj submit cat /etc/password

     When used from inside a rails application bj arranges that another process
     will always be running in the background to process the jobs that you submit.
     By using a separate process to run jobs bj does not impact the resource
     utilization of your rails application at all and enables several very cool
     features:

       1) Bj allows you to sumbit jobs to any of your configured databases and,
       in each case, spawns a separate background process to run jobs from that
       queue

         Bj.in :production do
           Bj.submit 'production_job.exe'
         end

         Bj.in :development do
           Bj.submit 'development_job.exe'
         end

       2) Although bj ensures that a process is always running to process
       your jobs, you can start a proces manually. This means that any machine
       capable of seeing your RAILS_ROOT can run jobs for your application, allowing
       one to setup a cluster of machines doing the work of a single front end rails
       applicaiton.

   ________________________________
   Install
   --------------------------------

     Bj can be installed two ways: as a gem or as a plugin.

       gem:
         1) $sudo gem install bj
         2) add "require 'bj'" to config/environment.rb
         3) bj setup

       plugin:
         1) ./script/plugin install http://codeforpeople.rubyforge.org/svn/rails/plugins/bj
         2) ./script/bj setup

   ________________________________
   Api
   --------------------------------

     submit jobs for background processing. 'jobs' can be a string or array of
     strings. options are applied to each job in the 'jobs', and the list of
     submitted jobs is always returned. options (string or symbol) can be

       :rails_env => production|development|key_in_database_yml
                     when given this keyword causes bj to submit jobs to the
                     specified database. default is RAILS_ENV.

       :priority => any number, including negative ones. default is zero.

       :tag => a tag added to the job. simply makes searching easier.

       :env => a hash specifying any additional environment vars the background
               process should have.

       :stdin => any stdin the background process should have.

     eg:

       jobs = Bj.submit 'echo foobar', :tag => 'simple job'

       jobs = Bj.submit '/bin/cat', :stdin => 'in the hat', :priority => 42

       jobs = Bj.submit './script/runner ./scripts/a.rb', :rails_env => 'production'

       jobs = Bj.submit './script/runner /dev/stdin',
                        :stdin => 'p RAILS_ENV',
                        :tag => 'dynamic ruby code'

       jobs Bj.submit array_of_commands, :priority => 451

   when jobs are run, they are run in RAILS_ROOT. various attributes are
   available *only* once the job has finished. you can check whether or not a
   job is finished by using the #finished method, which simple does a reload and
   checks to see if the exit_status is non-nil.

     eg:

       jobs = Bj.submit list_of_jobs, :tag => 'important'
       ...

       jobs.each do |job|
         if job.finished?
           p job.exit_status
           p job.stdout
           p job.stderr
         end
       end

   See lib/bj/api.rb for more details.

   ________________________________
   Sponsors
   --------------------------------
     http://www.engineyard.com/
     http://quintess.com/
     http://eparklabs.com/

PARAMETERS
   --rails_root=rails_root, -R (0 ~> rails_root=/Users/ahoward/rails_root)
       the rails_root will be guessed unless you set this
   --rails_env=rails_env, -E (0 ~> rails_env=development)
       set the rails_env
   --log=log, -l (0 ~> log=STDERR)
       set the logfile
   --help, -h

AUTHOR
   ara.t.howard@gmail.com

URIS
   http://codeforpeople.com/lib/ruby/
   http://rubyforge.org/projects/codeforpeople/
   http://codeforpeople.rubyforge.org/svn/rails/plugins/

a @ http://codeforpeople.com/
--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

But why this instead of BackgrounDRb?

···

On 12/12/07, ara.t.howard <ara.t.howard@gmail.com> wrote:

(note: i just pushed 0.0.2 up and the gem mirrors typically take ~ 30
minutes to sync - be sure you get 0.0.2!)

NAME
   bj

SYNOPSIS
   bj (migration_code|generate_migration|migrate|setup|run|submit|
list>set>config>pid) [options]+

DESCRIPTION
   ________________________________
   Overview
   --------------------------------

     Backgroundjob (Bj) is a simple to use background priority queue
for rails.
     Although not yet tested on windows, the design of bj is such
that operation
     should be possible on any operating system, including M$.

     Jobs can be submitted to the queue directly using the api or
from the
     commandline using the 'bj' script. For example

     code:
         Bj.submit 'cat /etc/password'

       cli:
         bj submit cat /etc/password

     When used from inside a rails application bj arranges that
another process
     will always be running in the background to process the jobs
that you submit.
     By using a separate process to run jobs bj does not impact the
resource
     utilization of your rails application at all and enables several
very cool
     features:

       1) Bj allows you to sumbit jobs to any of your configured
databases and,
       in each case, spawns a separate background process to run jobs
from that
       queue

         Bj.in :production do
           Bj.submit 'production_job.exe'
         end

         Bj.in :development do
           Bj.submit 'development_job.exe'
         end

       2) Although bj ensures that a process is always running to
process
       your jobs, you can start a proces manually. This means that
any machine
       capable of seeing your RAILS_ROOT can run jobs for your
application, allowing
       one to setup a cluster of machines doing the work of a single
front end rails
       applicaiton.

   ________________________________
   Install
   --------------------------------

     Bj can be installed two ways: as a gem or as a plugin.

       gem:
         1) $sudo gem install bj
         2) add "require 'bj'" to config/environment.rb
         3) bj setup

       plugin:
         1) ./script/plugin install http://
codeforpeople.rubyforge.org/svn/rails/plugins/bj
         2) ./script/bj setup

   ________________________________
   Api
   --------------------------------

     submit jobs for background processing. 'jobs' can be a string
or array of
     strings. options are applied to each job in the 'jobs', and the
list of
     submitted jobs is always returned. options (string or symbol)
can be

       :rails_env => production|development|key_in_database_yml
                     when given this keyword causes bj to submit jobs
to the
                     specified database. default is RAILS_ENV.

       :priority => any number, including negative ones. default is
zero.

       :tag => a tag added to the job. simply makes searching easier.

       :env => a hash specifying any additional environment vars the
background
               process should have.

       :stdin => any stdin the background process should have.

     eg:

       jobs = Bj.submit 'echo foobar', :tag => 'simple job'

       jobs = Bj.submit '/bin/cat', :stdin => 'in the hat', :priority
=> 42

       jobs = Bj.submit './script/runner ./scripts/a.rb', :rails_env
=> 'production'

       jobs = Bj.submit './script/runner /dev/stdin',
                        :stdin => 'p RAILS_ENV',
                        :tag => 'dynamic ruby code'

       jobs Bj.submit array_of_commands, :priority => 451

   when jobs are run, they are run in RAILS_ROOT. various attributes
are
   available *only* once the job has finished. you can check whether
or not a
   job is finished by using the #finished method, which simple does a
reload and
   checks to see if the exit_status is non-nil.

     eg:

       jobs = Bj.submit list_of_jobs, :tag => 'important'
       ...

       jobs.each do |job|
         if job.finished?
           p job.exit_status
           p job.stdout
           p job.stderr
         end
       end

   See lib/bj/api.rb for more details.

   ________________________________
   Sponsors
   --------------------------------
     http://www.engineyard.com/
     http://quintess.com/
     http://eparklabs.com/

PARAMETERS
   --rails_root=rails_root, -R (0 ~> rails_root=/Users/ahoward/
rails_root)
       the rails_root will be guessed unless you set this
   --rails_env=rails_env, -E (0 ~> rails_env=development)
       set the rails_env
   --log=log, -l (0 ~> log=STDERR)
       set the logfile
   --help, -h

AUTHOR
   ara.t.howard@gmail.com

URIS
   http://codeforpeople.com/lib/ruby/
   http://rubyforge.org/projects/codeforpeople/
   http://codeforpeople.rubyforge.org/svn/rails/plugins/

a @ http://codeforpeople.com/
--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

--
Giles Bowkett

Podcast: http://hollywoodgrit.blogspot.com
Blog: http://gilesbowkett.blogspot.com
Portfolio: http://www.gilesgoatboy.org
Tumblelog: http://giles.tumblr.com

well, backgrounddrb was originally written by ezra on top of my slave lib, and ezra is one of the sponsors of bj so hopefully he'll chime in with his reasons, but here are mine

1) much better name. gem install bj? require 'bj'? seriously giles...

2) backgrounddrb, afaik, is has proven to be a bit tricky for *non-experts* to manage and use in a production environment.

3) backgrounddrb aims to provide a 'rubyish' environment for code to execute in. in otherwords you call methods on on objects, serialize ruby objects over the wire, etc. this makes entire classes of problems easier to reason about, but it also comes with a price and that price is complexity. for example, most (all?) people have to think about methods like this when using drb

   remote_object.each do |thang|
     thang.intense_computation
   end

now, on which cpu does 'intense' run? in which process? the answer is that it entirely depends on how the objects where setup and how DRbUundumped may or may not have been used. as the maintainer of slave.rb i can tell you that the list of people who understand this is eric hodel and, um, eric hodel. the point is that drb is not an rpc mechanism but a toolset for building servants. using drb every process is potentially either a client or a server and generally both. it's the block passing mechanism that gets people into trouble - blocks cannot go across the wire to drb does some magic to make them work. the other issue with having a 'rubyish' environment to execute code in, in the case of using backgrounddrb with a rails app, is that rails' ruby code tends to do all sorts of nasty things like leak memory like a row boat full of hair trigger shotguns.

4) bj, on the otherhand, simply provides a way to fire and forget system calls. these system calls just may happen to use ./script/runner to run some code from within your rails environment, but that's up to you. it may even contact a long running daemon like backgrounddrb to avoid loading your rails app over and over, but again that's up to you. bj does *not* load your rails app or make that code available in any way. all it does is connect to the db and run jobs from a queue - which is another big difference: bj is a priority queue, you can submit 100,000 jobs and forget about it, they will run serially in the background until they are complete. another result of the design is that you can easily fire up runners on other hosts using bj - thereby creating a *cluster* of machines that run jobs on behalf of your front end(s) rails application. and, of course, it's easy for development to submit jobs into a production queue and vise versa. the last major difference is that bj is queuing job in the database whereas backgrounddrb is dealing with memory/context/closures - if you have backgrounded 100k credit card sales and your application crashes you can probably guess where having the jobs live would be best :wink: with bj the act of submitting a job is a db transaction that's submitted a job which can run on it's own two feet so you *know* once submission is complete that, no matter what happens next, that job is recoverable - at least to the extent your database/fs are.

backgrounddrb, bj, and spawn (http://wiki.rubyonrails.org/rails/pages/Tom+Anderson\) all serve totally different purposes. i think bj provides the lowest barrier of entry into doing background rails processing and, in cases where the user requires a rails_env and needs to wrap the methodology in a ./script/runner capable script makes up for making the user to a little work with promising that the application will not start leaking memory of having network issues in production once the script is working from the commandline.

i have not looked at the backgrounddrb code for some time - since the dependancy on slave.rb was removed - so i'm positive i've made a few errors in the above explanation - but i'm sure ezra can correct any serious mistakes i've made.

kind regards.

a @ http://codeforpeople.com/

···

On Dec 13, 2007, at 8:34 AM, Giles Bowkett wrote:

But why this instead of BackgrounDRb?

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

I'm trying out bj in a rails environment running 6 mongrel instances.
I'm seeing 6 bj processes running. One for each mongrel.
The pid of the mongrel is included in the running bj command
--ppid=99999.

I've tried to start a bj process by hand. Following the docs I did:
ruby script/bj run --forever --rails_env=qa
--rails_root=/mnt/app/current

I restarted my nongrel servers. They just ignore the running bj process
and start their own.

I know this isn't by design as you state that there will be only one bj
process for each machine.

Is there some configuration I'm missing? Thanks.

···

--
Posted via http://www.ruby-forum.com/.

Hi,

> But why this instead of BackgrounDRb?

well, backgrounddrb was originally written by ezra on top of my slave
lib, and ezra is one of the sponsors of bj so hopefully he'll chime
in with his reasons, but here are mine

1) much better name. gem install bj? require 'bj'? seriously giles...

2) backgrounddrb, afaik, is has proven to be a bit tricky for *non-
experts* to manage and use in a production environment.

3) backgrounddrb aims to provide a 'rubyish' environment for code to
execute in. in otherwords you call methods on on objects, serialize
ruby objects over the wire, etc. this makes entire classes of
problems easier to reason about, but it also comes with a price and
that price is complexity. for example, most (all?) people have to
think about methods like this when using drb

   remote_object.each do |thang|
     thang.intense_computation
   end

now, on which cpu does 'intense' run? in which process? the answer
is that it entirely depends on how the objects where setup and how
DRbUundumped may or may not have been used. as the maintainer of
slave.rb i can tell you that the list of people who understand this
is eric hodel and, um, eric hodel. the point is that drb is not an
rpc mechanism but a toolset for building servants. using drb every
process is potentially either a client or a server and generally
both. it's the block passing mechanism that gets people into trouble
- blocks cannot go across the wire to drb does some magic to make
them work. the other issue with having a 'rubyish' environment to
execute code in, in the case of using backgrounddrb with a rails app,
is that rails' ruby code tends to do all sorts of nasty things like
leak memory like a row boat full of hair trigger shotguns.

4) bj, on the otherhand, simply provides a way to fire and forget
system calls. these system calls just may happen to use ./script/
runner to run some code from within your rails environment, but
that's up to you. it may even contact a long running daemon like
backgrounddrb to avoid loading your rails app over and over, but
again that's up to you. bj does *not* load your rails app or make
that code available in any way. all it does is connect to the db and
run jobs from a queue - which is another big difference: bj is a
priority queue, you can submit 100,000 jobs and forget about it, they
will run serially in the background until they are complete. another
result of the design is that you can easily fire up runners on other
hosts using bj - thereby creating a *cluster* of machines that run
jobs on behalf of your front end(s) rails application. and, of
course, it's easy for development to submit jobs into a production
queue and vise versa. the last major difference is that bj is
queuing job in the database whereas backgrounddrb is dealing with
memory/context/closures - if you have backgrounded 100k credit card
sales and your application crashes you can probably guess where
having the jobs live would be best :wink: with bj the act of submitting
a job is a db transaction that's submitted a job which can run on
it's own two feet so you *know* once submission is complete that, no
matter what happens next, that job is recoverable - at least to the
extent your database/fs are.

backgrounddrb, bj, and spawn (http://wiki.rubyonrails.org/rails/pages/
Tom+Anderson) all serve totally different purposes. i think bj
provides the lowest barrier of entry into doing background rails
processing and, in cases where the user requires a rails_env and
needs to wrap the methodology in a ./script/runner capable script
makes up for making the user to a little work with promising that the
application will not start leaking memory of having network issues in
production once the script is working from the commandline.

i have not looked at the backgrounddrb code for some time - since the
dependancy on slave.rb was removed - so i'm positive i've made a few
errors in the above explanation - but i'm sure ezra can correct any
serious mistakes i've made.

I am maintaining backgruondrb in these younger days. And things have
changed. Its written on top of event driven networking lib ( packet )
that i wrote.

There are no threads anywhere. Everything is even driven, it still has
real processes, but those have reactor loop of their own. I wrote a
custom protocol for internal communication between workers and it works
reasonably well.

I agree that, bj , spawn they all have different purpose. If you find
time, please look into code base and suggest any problems that you find.

···

On Fri, 2007-12-14 at 01:53 +0900, ara.t.howard wrote:

On Dec 13, 2007, at 8:34 AM, Giles Bowkett wrote:

--
Let them talk of their oriental summer climes of everlasting
conservatories; give me the privilege of making my own summer with my
own coals.

http://gnufied.org

ara.t.howard wrote:
<snip>

4) bj, on the otherhand, simply provides a way to fire and forget system calls. these system calls just may happen to use ./script/runner to run some code from within your rails environment, but that's up to you. it may even contact a long running daemon like backgrounddrb to avoid loading your rails app over and over, but again that's up to you. bj does *not* load your rails app or make that code available in any way. all it does is connect to the db and run jobs from a queue - which is another big difference: bj is a priority queue, you can submit 100,000 jobs and forget about it, they will run serially in the background until they are complete. another result of the design is that you can easily fire up runners on other hosts using bj - thereby creating a *cluster* of machines that run jobs on behalf of your front end(s) rails application. and, of course, it's easy for development to submit jobs into a production queue and vise versa. the last major difference is that bj is queuing job in the database whereas backgrounddrb is dealing with memory/context/closures - if you have backgrounded 100k credit card sales and your application crashes you can probably guess where having the jobs live would be best :wink: with bj the act of submitting a job is a db transaction that's submitted a job which can run on it's own two feet so you *know* once submission is complete that, no matter what happens next, that job is recoverable - at least to the extent your database/fs are.

My word. I think you've just saved me a ton of work. Yet again.

A quick question, though. How difficult is it to set up parallel job queues, so that a cluster node can pick up jobs from one queue, process them, and submit them to the next in a chain? Take a search engine's spider as an example - from 20,000 feet you've got a job that fetches a page, a job to parse the contents, followed by a third to index the parsed structure. Chances are that you want different types of cluster node to work on each type of job, and there's different data that you might want to attach at each stage. Is that easy to set up?

···

--
Alex

it's a documentation flaw - sorry. if you want to run only one instance run it by hand via cron as the docs show - this is vastly easier to monitor and allows the background process to run on ta different host to boot.

cheers.

a @ http://codeforpeople.com/

···

On Jul 7, 2008, at 1:02 PM, Colin Shield wrote:

I know this isn't by design as you state that there will be only one bj
process for each machine.

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

A comparison between them would be enlightening. I knew about
backgrounDRb only. I'm prototyping a payroll system and I need to
trigger long time processes, between reports and processes that modify
the state of the database.

···

On Dec 13, 2007 11:33 AM, hemant kumar <gethemant@gmail.com> wrote:

I agree that, bj , spawn they all have different purpose. If you find
time, please look into code base and suggest any problems that you find.

--
Gerardo Santana

hey hemant - didn't know you'd taken that over. so it seems everyone is in agreement here and to be entirely clear, i'm not suggesting there is anything wrong with either spawn or backgrounddrb. i thought i'd take a quick stab at use cases for each and hopefully you can clarify so people can understand

1) spawn

this just forks your rails app and, as such, is the simplest way to get a background process that has context, etc. it's not going to easily allow say, 1000 incoming requests because you are duping your entire rails process on the fork so you have to be careful with this and about collecting the children so as not to create zombies.

2) backgrounddrb

this is going to be good where you have a medium number of background processes, esp if you want to interact with with them. for example, a task, spawn with an ajax request, like a video conversion, cold then be polled using periodically_call_remote to display progress to the user. these processes are going to be living in memory and, unless you code it, there is no concept of queueing.

3) bj

this is for fire and forget standalone processes that may or may not load the rails_env. examples would include adding users based on an uploaded csv file and emailing each one or updating 100 rss feeds in the background. the queueing effect is going to save your butt if many requests come in at once and tasks are going to durable across application restart and even reboot (since they live in the db). bj is good where you want to be able track which jobs succeeded or failed, possibly by an external sweeper process, taking appropriate actions as needed. bj tasks are going to be slower to start if you require the rails_env, since you'll need to load the rails app for each process but the memory if going to be freed on each transient task to leaks are of no concern.

maybe others can add to this stab a summary...

cheers.

a @ http://codeforpeople.com/

···

On Dec 13, 2007, at 10:33 AM, hemant kumar wrote:

I am maintaining backgruondrb in these younger days. And things have
changed. Its written on top of event driven networking lib ( packet )
that i wrote.

There are no threads anywhere. Everything is even driven, it still has
real processes, but those have reactor loop of their own. I wrote a
custom protocol for internal communication between workers and it works
reasonably well.

I agree that, bj , spawn they all have different purpose. If you find
time, please look into code base and suggest any problems that you find.

--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

not exactly but this would be quite close:

1) i'd forget about having specialized nodes unless you have a very good reason - the death of one node will halt the entire processing chain otherwise. it's nice if nodes are dumb from the perspective of robustness. that said i'll add a feature where you can say

   Bj.submit 'job.exe', :runner => 'some.hostname'

to specify which host to run on. this'll be two lines of code so i don't mind adding it.

2) bj supports priorities so here is what i would do. say you've got a three stage job: a, b, c and 1000 initial 'a' tasks. furthermore let's say you make a ./scripts/ directory in your rails_root (bj runs all jobs from the rails_root). so you'll have something like

   ./scripts/task_a
   ./scripts/task_b
   ./scripts/task_c

then you'd do something like this in your rails app

   jobs = inputs.map{|input| "./scripts/task_a #{ input }}
   Bj.submit jobs, :priority => 10

now task_a is going to do this

   #! /usr/bin/env ruby
   input = ARGV.shift
   output = process_for_task_a input
   system "./script/bj submit ./scripts/task_b #{ output } --priority=20"

(of course, if your processing needs to be run through ./script/runner you'll just be able to use the api directly instead of the cli... i'll be adding a feature shortly to allow for running ruby code through script runner directly)

task_b, for it's part, runs and submits task_c at priority=30.

so think about that for a minute and imagine you have three processes nodes - each will consume a task_a, run it, and then submit a priority=20 job. therefore each node will probably then get one of those higher priority jobs, run that, and then find the priority=30 task_c job in the queue. when those are done there will nothing left except priority=10 task_a jobs and another batch will start.

so this will give you parallel processing of a host of tasks.

make sense?

a @ http://codeforpeople.com/

···

On Dec 13, 2007, at 10:36 AM, Alex Young wrote:

My word. I think you've just saved me a ton of work. Yet again.

A quick question, though. How difficult is it to set up parallel job queues, so that a cluster node can pick up jobs from one queue, process them, and submit them to the next in a chain? Take a search engine's spider as an example - from 20,000 feet you've got a job that fetches a page, a job to parse the contents, followed by a third to index the parsed structure. Chances are that you want different types of cluster node to work on each type of job, and there's different data that you might want to attach at each stage. Is that easy to set up?

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

ara.t.howard wrote:
if you want to run only one

instance run it by hand via cron as the docs show

Thanks for the reply. I've got it all working nicely in production now.
We use monit in production.
The config is under source control so we can deploy changes with cap.
I wrote some monit config to run a bj process. This may help somebody.

##### BJ ####
# BJ is a ruby script that manages background processes.
check process bj with pidfile /mnt/app/shared/pids/bj.pid
    start program = "/bin/bash /mnt/app/current/script/bj.sh start
production &"
    stop program = "/bin/bash /mnt/app/current/script/bj.sh stop"

I wrote a bash wrapper (bj.sh) as monit uses pidfiles.

#!/bin/bash

case $1 in
start)
echo $$ > /mnt/app/shared/pids/bj.pid;
exec 2>&1 /usr/bin/ruby1.8 /mnt/app/current/script/bj run --forever
--redirect=/mnt/app/shared/log/bj.log --rails_env=$2
--rails_root=/mnt/app/current 1>/mnt/app/shared/log/bj.log
;;
stop)
kill `cat /mnt/app/shared/pids/bj.pid`; rm /mnt/app/shared/pids/bj.pid
;;
*)
echo "usage: bj.sh {start <stage>|stop}" ;;
esac

···

On Jul 7, 2008, at 1:02 PM, Colin Shield wrote:

--
Posted via http://www.ruby-forum.com/\.

ara.t.howard wrote:

I know this isn't by design as you state that there will be only one
bj process for each machine.

it's a documentation flaw - sorry. if you want to run only one
instance run it by hand via cron as the docs show - this is vastly
easier to monitor and allows the background process to run on ta
different host to boot.

Ara, I'd like to suggest mentioning this in the README. Everything I've
read aside from this post (which I was lucky to find) makes it seem as
though only once instance of Bj should ever be running at a time. This
caused me quite a lot of frustration trying to track down the problem -
only to find that there wasn't one!

From /bin/bj:

Bj ensures that only one background process is running for your application -
firing up three mongrels or fcgi processes will result in only one background
runner being started. Note that the number of background runners does not
determine throughput - that is determined primarily by the nature of the jobs
themselves and how much work they perform per process.

Thanks,
- Trevor

···

On Jul 7, 2008, at 1:02 PM, Colin Shield wrote:

--
Posted via http://www.ruby-forum.com/\.

ara.t.howard wrote:

My word. I think you've just saved me a ton of work. Yet again.

A quick question, though. How difficult is it to set up parallel job queues, so that a cluster node can pick up jobs from one queue, process them, and submit them to the next in a chain? Take a search engine's spider as an example - from 20,000 feet you've got a job that fetches a page, a job to parse the contents, followed by a third to index the parsed structure. Chances are that you want different types of cluster node to work on each type of job, and there's different data that you might want to attach at each stage. Is that easy to set up?

not exactly but this would be quite close:

1) i'd forget about having specialized nodes unless you have a very good reason - the death of one node will halt the entire processing chain otherwise.

I should have been a little clearer - I'm not thinking of one node per task, it's one *class* of nodes per task. I might want 21 processes across 3 machines (for example) all working on the first stage in the chain.

it's nice if nodes are dumb from the perspective of robustness. that said i'll add a feature where you can say

  Bj.submit 'job.exe', :runner => 'some.hostname'

to specify which host to run on. this'll be two lines of code so i don't mind adding it.

I can see how that'd be a handy thing to have anyway :slight_smile:

2) bj supports priorities so here is what i would do. say you've got a three stage job: a, b, c and 1000 initial 'a' tasks. furthermore let's say you make a ./scripts/ directory in your rails_root (bj runs all jobs from the rails_root). so you'll have something like

  ./scripts/task_a
  ./scripts/task_b
  ./scripts/task_c

then you'd do something like this in your rails app

(I'm not using Rails for the app I'm thinking of using this in, but that's not important)

  jobs = inputs.map{|input| "./scripts/task_a #{ input }}
  Bj.submit jobs, :priority => 10

now task_a is going to do this

  #! /usr/bin/env ruby
  input = ARGV.shift
  output = process_for_task_a input
  system "./script/bj submit ./scripts/task_b #{ output } --priority=20"

(of course, if your processing needs to be run through ./script/runner you'll just be able to use the api directly instead of the cli... i'll be adding a feature shortly to allow for running ruby code through script runner directly)

task_b, for it's part, runs and submits task_c at priority=30.

so think about that for a minute and imagine you have three processes nodes - each will consume a task_a, run it, and then submit a priority=20 job. therefore each node will probably then get one of those higher priority jobs, run that, and then find the priority=30 task_c job in the queue. when those are done there will nothing left except priority=10 task_a jobs and another batch will start.

so this will give you parallel processing of a host of tasks.

make sense?

It does, but it's not *quite* what I'm after. I've got a few other requirements that this strains against - the most pertinent being that I'd like to be able to use priority independently within each task queue. I've got a bit of spare time coming up in the next couple of weeks (Holiday! What a concept! :-), so I'll try hacking something together based on your code.

···

On Dec 13, 2007, at 10:36 AM, Alex Young wrote:

--
Alex

I'm trying to get bj set up with Monit as well, but I can't seem to
find the pid file in the usual places (including app/shared/pids/
bj.pid). Is there anyway I can specify a location for the pid file?
Ideally I'd want to stick it outside the virtual file system which is
shared across slices.

Thanks,
Scott

···

On Jul 9, 1:01 pm, Colin Shield <colin_shi...@hotmail.com> wrote:

kill `cat /mnt/app/shared/pids/bj.pid`; rm /mnt/app/shared/pids/bj.pid

maybe just use rq then?

a @ http://codeforpeople.com/

···

On Dec 13, 2007, at 2:32 PM, Alex Young wrote:

(I'm not using Rails for the app I'm thinking of using this in, but that's not important)

--
we can deny everything, except that we have the possibility of being better. simply reflect on that.
h.h. the 14th dalai lama

ara.t.howard wrote:

(I'm not using Rails for the app I'm thinking of using this in, but that's not important)

maybe just use rq then?

Maybe. It's another option to check out :slight_smile:

···

On Dec 13, 2007, at 2:32 PM, Alex Young wrote:

--
Alex