Horribly impossible debugging task

i've got 30 process running on 30 machines running jobs taken from an nfs mounted
queue. recently i started seeing random core dumps from them. i've isolated
the bit of code that causes the core dumps to occur - it's this

   class JobRunner
#{{{
     attr :job
     attr :jid
     attr :cid
     attr :shell
     attr :command
     def initialize job
#{{{
       @job = job
       @jid = job['jid']
       @command = job['command']
       @shell = job['shell'] || 'bash'
       @r,@w = IO.pipe
       @cid =
         Util::fork do
           @w.close
           STDIN.reopen @r

         if $want_to_core_dump

           keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
           256.times do |fd|
             next if keep.include? fd
             begin
               IO::new(fd).close
             rescue Errno::EINVAL, Errno::EBADF
             end
           end

         end

           if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
             exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
           else
             exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
           end
         end
       @r.close
#}}}
     end
     def run
#{{{
       @w.puts @command
       @w.close
#}}}
     end
#}}}
   end

now heres the tricky bit. the core dump doesn't happen here - it happens at
some random time later, and then again sometimes it doesn't. the context this
code executes in is complex, but here's the just of it

   sqlite database transaction started - this opens some files like db-journal,
   etc.

   a job is selected from database

     fork job runner - this closes open files except stdin, stdout, stderr, and
     com pipe

   the job pid and other accounting is committed to database

the reason i'm trying to close all the files in the first place is because the
parent eventually unlinks some of them while the child still has them open -
this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
this causes no harm as the child never uses these fds - but with 30 machines i
i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
eventually go away when the child exits but some of these children run for 4
or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
to be quite large.

back to the core dump...

basically if i DO close all the filehandles i'll, maybe, core dump sometime
later IN THE PARENT. if i do NOT close them the parent never core dumps. the
core dumps are totally random and show nothing in common execpt one thing -
they all show a signal received in the stack trace - i'm guessing this is
SIGCHLD. i have some signal handlers setup for stopping/restarting that look
exactly like this:

       trap('SIGHUP') do
         $signaled = $sighup = true
         warn{ "signal <SIGHUP>" }
       end
       trap('SIGTERM') do
         $signaled = $sigterm = true
         warn{ "signal <SIGTERM>" }
       end
       trap('SIGINT') do
         $signaled = $sigint = true
         warn{ "signal <SIGINT>" }
       end

in my event loop i obviously take appropriate steps for the $sigXXX.

as i said, however, i don't think these are responsible since they don't
actually get run as these signals are not being sent. i DO fork for every job
though so that's why i'm guessing the signal is SIGCHLD.

so - here's the question: what kind of badness could closing fd's be causing
in the PARENT? i'm utterly confused at this point and don't really know
where to look next... could this be a ruby bug or am i just breaking some
unix law and getting bitten.

thanks for any advice.

kind regards.

-a

···

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Ara --

    Random thoughts:

      * It could be a race condition of some sort
      * It could be that closing the file in the child closes it for the
        parent even though closing it for the parent does not close it
        for the child
      * It could be that you omitted a file from your keep list that the
        child actually needs. It tries to access it, goes boom,...
      * can you make it happen in a simplified situation (e.g. one
        child, etc.)
      * is it possible to make nfs put the ugly files somewhere you
        can't see them? I know much of the software I run has lots of
        ugly files (e.g. the web browser cache), but they don't bother
        me because I don't look at them.
      * Instead of specifying the files you want to keep (STDIN, etc)
        could you list the ones you want to close, and narrow the
        problem down that way?

    I don't know if any of these will help, but I can't see that they
could hurt (I used to say that "ideas can't hurt you" but I'm older
now).

      -- MarkusQ

···

On Thu, 2004-09-16 at 11:54, Ara.T.Howard wrote:

i've got 30 process running on 30 machines running jobs taken from an nfs mounted
queue. recently i started seeing random core dumps from them. i've isolated
the bit of code that causes the core dumps to occur - it's this

   class JobRunner
#{{{
     attr :job
     attr :jid
     attr :cid
     attr :shell
     attr :command
     def initialize job
#{{{
       @job = job
       @jid = job['jid']
       @command = job['command']
       @shell = job['shell'] || 'bash'
       @r,@w = IO.pipe
       @cid =
         Util::fork do
           @w.close
           STDIN.reopen @r

         if $want_to_core_dump

           keep = [STDIN, STDOUT, STDERR, @r].map{|io| io.fileno}
           256.times do |fd|
             next if keep.include? fd
             begin
               IO::new(fd).close
             rescue Errno::EINVAL, Errno::EBADF
             end
           end

         end

           if File::basename(@shell) == 'bash' || File::basename(@shell) == 'sh'
             exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '--login'
           else
             exec [@shell, "__rq_job__#{ @jid }__#{ File.basename(@shell) }__"], '-l'
           end
         end
       @r.close
#}}}
     end
     def run
#{{{
       @w.puts @command
       @w.close
#}}}
     end
#}}}
   end

now heres the tricky bit. the core dump doesn't happen here - it happens at
some random time later, and then again sometimes it doesn't. the context this
code executes in is complex, but here's the just of it

   sqlite database transaction started - this opens some files like db-journal,
   etc.

   a job is selected from database

     fork job runner - this closes open files except stdin, stdout, stderr, and
     com pipe

   the job pid and other accounting is committed to database

the reason i'm trying to close all the files in the first place is because the
parent eventually unlinks some of them while the child still has them open -
this causes nfs sillynames to appear when running on nfs (.nfsxxxxxxxxx).
this causes no harm as the child never uses these fds - but with 30 machines i
i end up with 90 or more .nfsxxxxxxx files lying around looking ugly. these
eventually go away when the child exits but some of these children run for 4
or 5 or 10 days so the ugliness is constantly in my face - sometimes growing
to be quite large.

back to the core dump...

basically if i DO close all the filehandles i'll, maybe, core dump sometime
later IN THE PARENT. if i do NOT close them the parent never core dumps. the
core dumps are totally random and show nothing in common execpt one thing -
they all show a signal received in the stack trace - i'm guessing this is
SIGCHLD. i have some signal handlers setup for stopping/restarting that look
exactly like this:

       trap('SIGHUP') do
         $signaled = $sighup = true
         warn{ "signal <SIGHUP>" }
       end
       trap('SIGTERM') do
         $signaled = $sigterm = true
         warn{ "signal <SIGTERM>" }
       end
       trap('SIGINT') do
         $signaled = $sigint = true
         warn{ "signal <SIGINT>" }
       end

in my event loop i obviously take appropriate steps for the $sigXXX.

as i said, however, i don't think these are responsible since they don't
actually get run as these signals are not being sent. i DO fork for every job
though so that's why i'm guessing the signal is SIGCHLD.

so - here's the question: what kind of badness could closing fd's be causing
in the PARENT? i'm utterly confused at this point and don't really know
where to look next... could this be a ruby bug or am i just breaking some
unix law and getting bitten.

thanks for any advice.

kind regards.

-a
--

> EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
> PHONE :: 303.497.6469
> A flower falls, even though we love it;
> and a weed grows, even though we do not love it.
> --Dogen

Ara.T.Howard wrote:

i've isolated the bit of code that causes the core dumps to occur
...
now heres the tricky bit. the core dump doesn't happen here - it happens at some random time later, and then again sometimes it doesn't.

This sort of scenario is almost always caused by a memory corruption,
either an array-out-of-bounds write causing corruption of the memory
allocation arena, or a reference to an object that's been deleted.
I've chased dozens of these up until five or so years ago, when I got
my hands on a copy of Purify. It catches the corruption *at source*,
and has become quite simply indispensable for this (and many other)
tasks.

The world would be a better place if every developer used Purify on
every release. Note that it's *not* the same as most "bounds checker"
type tools; it actually maintains a parallel table of markers for
every memory location, and *rewrites the machine instructions* for every
memory reference so it can also check validity against the marker table.

It's quite simply the bee's knees if you must write in C or a similarly
primitive language :-).

Clifford Heath.

Hi,

At Fri, 17 Sep 2004 03:54:52 +0900,
Ara.T.Howard wrote in [ruby-talk:112814]:

       @cid =
         Util::fork do

       trap('SIGHUP') do
         $signaled = $sighup = true
         warn{ "signal <SIGHUP>" }

What are these, "Util::fork" and "warn" with block?

···

--
Nobu Nakada

Ara.T.Howard wrote:

i've got 30 process running on 30 machines running jobs taken from an nfs mounted
queue. ...
  sqlite database transaction started - this opens some files like db-journal, etc.

Ara,

I could be way off here, but are you opening your SQLite database over NFS? I think this can often lead to problems due to the locking not working, so maybe something is going wrong inside the sqlite library code?

You might want to look at the section 7 on http://www.sqlite.org/faq.html\.

Cheers,
   Kevin

Ara --

   Random thoughts:

     * It could be a race condition of some sort

yes - perhaps even in some library code i'm exercising - this my current best
guess.

     * It could be that closing the file in the child closes it for the
       parent even though closing it for the parent does not close it
       for the child

hmmm - not that one:

harp:~ > ruby -e'f = open "f","w";fork{ f.close };Process.wait;f.puts 42'
harp:~ > cat f
42

     * It could be that you omitted a file from your keep list that the
       child actually needs. It tries to access it, goes boom,...

i do an exec of bash immediately after so i think that's out since bash cannot
possibly require anything ruby or sqlite has open other that stdin, stdout,
and stderr.

     * can you make it happen in a simplified situation (e.g. one
       child, etc.)

yes. but not predictably either. it can run for days, or minutes.
unfortunately (for debugging) it usually about 3 days before a core dump -
diffucult to work with...

     * is it possible to make nfs put the ugly files somewhere you
       can't see them? I know much of the software I run has lots of
       ugly files (e.g. the web browser cache), but they don't bother
       me because I don't look at them.

i handle that this way now:

     def sillyclean dir = @dirname
#{{{
       glob = File.join dir,'.nfs*'
       orgsilly = Dir[glob]
       yield
       newsilly = Dir[glob]
       silly = newsilly - orgsilly
       silly.each{|path| FileUtils::rm_rf path}
#}}}
     end

this code wraps ONLY the transaction/fork code. it is safe because i know any
silly file left over from a transaction was created due to the sqlite not
setting close-on-exec on it's tmp files. plus removing a silly file cannot
hurt because they spring back into existence (by definition) if someone
actually still needs them. so, if the remove succeeds it no-one was actually
using them. this is indeed what happens - they are removed never to return.
i just hate this sort of thing.

     * Instead of specifying the files you want to keep (STDIN, etc)
       could you list the ones you want to close, and narrow the
       problem down that way?

yes - i'm working on that. the problem is that i actually KNOW the filename
that gets unlinked and causes the sillyname - it's the 'db-journal' file (i
can see a .nfsXXXX file come into existence with it's exact contents). the
problem is that the sqlite api opens this file and i have no file handle on
it. problem two is that ruby does not provide a way to get at this info that
i know of. you could

   256.times do |fd|
     begin
       file = IO::new fd
       File::unlink file.path if file.path =~ %r/db-journal/o
     rescue Errno::EBADF, Errno::EINVAL
     end
   end

__except__ that File objects created this way do not have a path! (nor
respond_to?('path') for that matter) - at least on my ruby. i'm not sure if
this is a bug or not...

   I don't know if any of these will help, but I can't see that they
could hurt (I used to say that "ideas can't hurt you" but I'm older
now).

funny. yeah - anything helps - i'm grasping at straws!

cheers.

-a

···

On Fri, 17 Sep 2004, Markus wrote:
--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Sounds like the same thing valgrind does (for free). It might be
interesting to try valgrind on this, if it's a memory related bug. The
downside is that running the code through valgrind will give you a
slowdown with a factor 30 to 60 (from personal experience). So, not
really an option if the bug only shows up after a couple of days...

Ruben

···

At Fri, 17 Sep 2004 14:46:07 +0900, Clifford Heath wrote:

Ara.T.Howard wrote:
> i've isolated the bit of code that causes the core dumps to occur
> ...
> now heres the tricky bit. the core dump doesn't happen here - it
> happens at some random time later, and then again sometimes it doesn't.

This sort of scenario is almost always caused by a memory corruption,
either an array-out-of-bounds write causing corruption of the memory
allocation arena, or a reference to an object that's been deleted.
I've chased dozens of these up until five or so years ago, when I got
my hands on a copy of Purify. It catches the corruption *at source*,
and has become quite simply indispensable for this (and many other)
tasks.

The world would be a better place if every developer used Purify on
every release. Note that it's *not* the same as most "bounds checker"
type tools; it actually maintains a parallel table of markers for
every memory location, and *rewrites the machine instructions* for every
memory reference so it can also check validity against the marker table.

It's quite simply the bee's knees if you must write in C or a similarly
primitive language :-).

Clifford Heath.

Clifford Heath <cjh-nospam@nospaManagesoft.com> wrote in message news:<1095399555.977916@excalibur.osa.com.au>...

The world would be a better place if every developer used Purify on
every release.

Also worth checking out (for linux/x86 only, I believe) is valgrind.
It also does very good memory/bounds checking, and is free.

Nathan

Util::fork is simply a 'quiet' fork:

     module Util
   #{{{
       class << self
         def export sym
   #{{{
           sym = "#{ sym }".intern
           module_function sym
           public sym
   #}}}
         end
         def append_features c
   #{{{
           super
           c.extend Util
   #}}}
         end
       end

       ...

       def fork(*a, &b)
   #{{{
         begin
           verbose = $VERBOSE
           $VERBOSE = nil
           Process::fork(*a, &b)
         ensure
           $VERBOSE = verbose
         end
   #}}}
       end
       export 'fork'

       ...

   #}}}
     end

warn with block delegates to Logger object:

   class Main
   #{{{

     ...

     %w( debug info warn error fatal ).each do |m|
       eval "def #{ m }(*a,&b);@logger.#{ m }(*a,&b);end"
     end

     ...

   #}}}
   end

regards.

-a

···

On Wed, 22 Sep 2004 nobu.nokada@softhome.net wrote:

Hi,

At Fri, 17 Sep 2004 03:54:52 +0900,
Ara.T.Howard wrote in [ruby-talk:112814]:

       @cid =
         Util::fork do

       trap('SIGHUP') do
         $signaled = $sighup = true
         warn{ "signal <SIGHUP>" }

What are these, "Util::fork" and "warn" with block?

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

I could be way off here, but are you opening your SQLite database over NFS?

oh yeah - definitely, from many machines at once! :wink:

I think this can often lead to problems due to the locking not working, so
maybe something is going wrong inside the sqlite library code?

the locking is fcntl based - so it's nfs safe on any decent (not sun) nfs
implimentation. ours in pure linux on both server and client nodes.

You might want to look at the section 7 on http://www.sqlite.org/faq.html\.

i have. :wink:

essentially i am not relying on sqlite's locking exclusively : my code has an
additional 'lock file' (empty file which to apply nfs safe locks - see my
posixlock module on the raa) which i use the ensure single writer multiple
reader semantics on a __file__ level (sqlite guarantees this on a
__byte_range__ level). in addition i am using a nfs safe lockfile class (my
lockfile package in the raa) to assist for certain touchy operations. in
summary i am manually coordinating access to the database in a way that is
safe and transactionally protected. the access is logically this:

   aquire separate lock of read or write type

     open database

       begin a transaction

         execute sql

       end transaction

     close database

   release separate lock of read or write type

this is wrapped with code that autodetects and recoveres from several
potential errors such as a failed lockd server or failed io operations.
although i can force these to happen and my code handles it i have never
actually seen it happen in practice.

the code in question is a system that allows scientists to configure a linux
cluster to work on a huge stack of work in under a minute with zero sysad
intervention. at this point we've run about 3 million jobs through the system
without incident in the face of two power outages, dozens of reboots, and
steady extreme (load > 30) nfs load.

here's a shot of one of our clusters now:

   yacht:~/shared > rq queue status

···

On Wed, 22 Sep 2004, Kevin McConnell wrote:
   ---
   pending : 5875
   running : 36
   finished : 1108
   dead : 0

   yacht:~/shared > rq queue list running | head -20
   ---
   -
    jid: 1324
    priority: 0
    state: running
    submitted: 2004-09-20 09:16:39.449169
    started: 2004-09-22 03:55:24.914682
    finished:
    elapsed:
    submitter: jib.ngdc.noaa.gov
    runner: redfish.ngdc.noaa.gov
    pid: 11519
    exit_status:
    command: /dmsp/moby-1-1/cfadmin/shared/jobs/wavgjob /dmsp/moby-1-1/conf/avg_dn/filelists/F142000.included F142000.cloud2.light1.tile8 /dmsp/moby-1-1/conf/avg_dn/cloud2.light1.tile8.conf cfd2://cfd2-3/F142000/
   -
    jid: 1325
    priority: 0
    state: running
    submitted: 2004-09-20 09:16:39.449169
    started: 2004-09-22 04:12:32.758249

this stack of work will take about a week to complete using 18 nodes.

from the man page of the main commandline program 'rq':

   NAME
     rq v0.1.2

   SYNOPSIS
     rq [queue] mode [mode_args]* [options]*

   DESCRIPTION
     rq is an __experimental__ tool used to manage nfs mounted work
     queues. multiple instances of rq on multiples hosts can work from
     these queues to distribute processing load to 'n' nodes - bringing many dozens
     of otherwise powerful cpus to their knees with a single blow. clearly this
     software should be kept out of the hands of radicals, SETI enthusiasts, and
     one mr. jeff safran.

     rq operates in one of the modes create, submit, feed, list, delete,
     query, or help. depending on the mode of operation and the options used the
     meaning of mode_args may change, sometime wildly and unpredictably (i jest, of
     course).

   MODES

     modes may be abbreviated to uniqueness, therefore the following shortcuts
     apply :

       c => create
       s => submit
       f => feed
       l => list
       d => delete
       q => query
       h => help

     create, c :

       creates a queue. the queue MUST be located on an nfs mounted file system
       visible from all nodes intended to run jobs from it.

       examples :

         0) to create a queue
             ~ > rq q create
           or simply
             ~ > rq q c

     list, l :

       show combinations of pending, running, dead, or finished jobs. for this
       command mode_args must be one of pending, running, dead, finished, or all.
       the default is all.

       mode_args may be abbreviated to uniqueness, therefore the following
       shortcuts apply :

         p => pending
         r => running
         f => finished
         d => dead
         a => all

       examples :

         0) show everything in q
             ~ > rq q list all
           or
             ~ > rq q l all
           or
             ~ > export RQ_Q=q
             ~ > rq l

         0) show q's pending jobs
             ~ > rq q list pending

         1) show q's running jobs
             ~ > rq q list running

         2) show q's finished jobs
             ~ > rq q list finshed

     submit, s :

       submit jobs to a queue to be proccesed by any feeding node. any mode_args
       are taken as the command to run. note that mode_args are subject to shell
       expansion - if you don't understand what this means do not use this feature.

       when running in submit mode a file may by specified as a list of commands to
       run using the '--infile, -i' option. this file is taken to be a newline
       separated list of commands to submit, blank lines and comments (#) are
       allowed. if submitting a large number of jobs the input file method is MUCH
       more efficient. if no commands are specified on the command line rq
       automaticallys reads them from STDIN. yaml formatted files are also allowed
       as input (http://www.yaml.org/\) - note that output of nearly all rq
       commands is valid yaml and may, therefore, be piped as input into the submit
       command.

       the '--priority, -p' option can be used here to determine the priority of
       jobs. priorities may be any number (0, 10]; therefore 9 is the maximum
       priority. submitting a high priority job will NOT supplant currently
       running low priority jobs, but higher priority jobs will always migrate
       above lower priority jobs in the queue in order that they be run sooner.
       note that constant submission of high priority jobs may create a starvation
       situation whereby low priority jobs are never allowed to run. avoiding this
       situation is the responsibility of the user.

       examples :

         0) submit the job ls to run on some feeding host

           ~ > rq q s ls

         1) submit the job ls to run on some feeding host, at priority 9

           ~ > rq -p9 q s ls

         2) submit 42000 jobs (quietly) to run from a command file.

           ~ > wc -l cmdfile
           42000
           ~ > rq q s -q < cmdfile

         3) submit 42 jobs to run at priority 9 from a command file.

           ~ > wc -l cmdfile
           42
           ~ > rq -p9 q s < cmdfile

         4) re-submit all finished jobs

           ~ > rq q l f | rq q s

     feed, f :

       take jobs from the queue and run them on behalf of the submitter. jobs are
       taken from the queue in an 'oldest highest priority' order.

       feeders can be run from any number of nodes allowing you to harness the CPU
       power of many nodes simoultaneously in order to more effectively clobber
       your network.

       the most useful method of feeding from a queue is to do so in daemon mode so
       that if the process loses it's controling terminal and will not exit when
       you exit your terminal session. use the '--daemon, -d' option to accomplish
       this. by default only one feeding process per host per queue is allowed to
       run at any given moment. because of this it is acceptable to start a feeder
       at some regular interval from a cron entry since, if a feeder is alreay
       running, the process will simply exit and otherwise a new feeder will be
       started. in this way you may keep feeder processing running even acroess
       machine reboots.

       examples :

         0) feed from a queue verbosely for debugging purposes, using a minimum and
            maximum polling time of 2 and 4 respectively

           ~ > rq q feed -v4 -m2 -M4

         1) feed from a queue in daemon mode logging into /home/ahoward/rq.log

           ~ > rq q feed -d -l/home/ahoward/rq.log

         2) use something like this sample crontab entry to keep a feeder running
            forever (it attempts to (re)start every fifteen minutes)

           #
           # your crontab file
           #

           */15 * * * * /full/path/to/bin/rq /full/path/to/nfs/mounted/q f -d -l/home/user/rq.log

           log rolling while running in daemon mode is automatic.

     delete, d :

       delete combinations of pending, running, finished, dead, or specific jobs.
       the delete mode is capable of parsing the output of list mode, making it
       possible to create filters to delete jobs meeting very specific conditions.

       mode_args are the same as for 'list', including 'running'. note that it is
       possible to 'delete' a running job, but there is no way to actually STOP it
       mid execution since the node doing the deleteing has no way to communicate
       this information to the (possibly) remote execution host. therefore you
       should use the 'delete running' feature with care and only for housekeeping
       purposes or to prevent future jobs from being scheduled.

       examples :

         0) delete all pending, running, and finished jobs from a queue

           ~ > rq q d all

         1) delete all pending jobs from a queue

           ~ > rq q d p

         2) delete all finished jobs from a queue

           ~ > rq q d f

         3) delete jobs via hand crafted filter program

           ~ > rq q list | filter_prog | rq q d

     query, q :

       query exposes the database more directly the user, evaluating the where
       clause specified on the command line (or from STDIN). this feature can be
       used to make a fine grained slection of jobs for reporting or as input into
       the delete command. you must have a basic understanding of SQL syntax to
       use this feature, but it is fairly intuitive in this capacity.

       examples:

         0) show all jobs submitted within a specific 10 minute range

           ~ > rq q query "started >= '2004-06-29 22:51:00' and started < '2004-06-29 22:51:10'"

         1) shell quoting can be tricky here so input on STDIN is also allowed

           ~ > cat contraints
           started >= '2004-06-29 22:51:00' and
           started < '2004-06-29 22:51:10'

           ~ > rq q query < contraints
             or (same thing)

           ~ > cat contraints | rq q query

         2) this query output may then be used to delete specific jobs

           ~ > cat contraints | rq q query | rq q d

         3) show all jobs which are either finished or dead

           ~ > rq q q state=finished or state=dead

   NOTES
     - realize that your job is going to be running on a remote host and this has
       implication. paths, for example, should be absolute, not relative.
       specifically the submitted job must be visible from all hosts currently
       feeding from a q.

     - you need to consider __CAREFULLY__ what the ramifications of having multiple
       instances of your program all running at the same time will be. it is
       beyond the scope of rq to ensure multiple instances of a program
       will not overwrite each others output files, for instance. coordination of
       programs is left entirely to the user.

     - the list of finished jobs will grow without bound unless you sometimes
       delete some (all) of them. the reason for this is that rq cannot
       know when the user has collected the exit_status, etc. from a job and so
       keeps this information in the queue until instructed to delete it.

     - if you are using the crontab feature to maintain an immortal feeder on a
       host then that feeder will be running in the environment provided by cron.
       this is NOT the same environment found in a login shell and you may be
       suprised at the range of commands which do not function. if you want
       submitted jobs to behave as closely as possibly to their behaviour when
       typed interactively you'll need to wrap each job in a shell script that
       looks like the following:

         #/bin/bash --login
         commmands_for_your_job

       and submit that script

   ENVIRONMENT
     RQ_Q: full path to queue

       the queue argument to all commands may be omitted if, and only if, the
       environment variable 'RQ_Q' contains the full path to the q. eg.

         ~ > export RQ_Q=/full/path/to/my/q

       this feature can save a considerable amount of typing for those weak of wrist

   DIAGNOSTICS
    success => $? == 0
    failure => $? != 0

   AUTHOR
    ara.t.howard@noaa.gov

   BUGS
    1 < bugno && bugno <= 42

   OPTIONS

     -f, --feed=appetite
     -p, --priority=priority
         --name
     -d, --daemon
     -q, --quiet
     -e, --select
     -i, --infile=infile
     -M, --max_sleep=seconds
     -m, --min_sleep=seconds
     -l, --log=path
     -v=0-4|debug|info|warn|error|fatal
         --verbosity
         --log_age=log_age
         --log_size=log_size
     -c, --config=path
         --template=template
     -h, --help

so far it looks like the solution of my problem was to close the database after
forking (if it was open) but i'm still testing this approach.

kind regards.

-a
--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

actually both are options since the code in question simply manages a queue of
jobs and the cost is about 1000th the actual work. i'm used valgrind and
purify before with some success. i had a really hard to track down bug about
a year ago and ended up needing valgrind, purify, and dmalloc to track it
down. these are good suggestions as i'd forgotten about them. it'll be
pretty tough to set up but possible.

this is getting a bit OT now so any responders should probably ping me offline
unless anyone has anything specific to ruby regarding closing all file
descriptors after a fork and related bugs.

kind regards.

-a

···

On Fri, 17 Sep 2004, Ruben wrote:

At Fri, 17 Sep 2004 14:46:07 +0900, > Clifford Heath wrote:

Ara.T.Howard wrote:

i've isolated the bit of code that causes the core dumps to occur
...
now heres the tricky bit. the core dump doesn't happen here - it
happens at some random time later, and then again sometimes it doesn't.

This sort of scenario is almost always caused by a memory corruption,
either an array-out-of-bounds write causing corruption of the memory
allocation arena, or a reference to an object that's been deleted.
I've chased dozens of these up until five or so years ago, when I got
my hands on a copy of Purify. It catches the corruption *at source*,
and has become quite simply indispensable for this (and many other)
tasks.

The world would be a better place if every developer used Purify on
every release. Note that it's *not* the same as most "bounds checker"
type tools; it actually maintains a parallel table of markers for
every memory location, and *rewrites the machine instructions* for every
memory reference so it can also check validity against the marker table.

It's quite simply the bee's knees if you must write in C or a similarly
primitive language :-).

Clifford Heath.

Sounds like the same thing valgrind does (for free). It might be
interesting to try valgrind on this, if it's a memory related bug. The
downside is that running the code through valgrind will give you a
slowdown with a factor 30 to 60 (from personal experience). So, not
really an option if the bug only shows up after a couple of days...

Ruben

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Hello Ruben,

Sounds like the same thing valgrind does (for free). It might be
interesting to try valgrind on this, if it's a memory related bug. The
downside is that running the code through valgrind will give you a
slowdown with a factor 30 to 60 (from personal experience). So, not

And now you see the difference between good working (and expensive)
commercial tools and freeware tools like valgrid.

But to be honest your make valgrind worse then it is. The slowdown
should be a factor 10 to 20.

···

--
Best regards, emailto: scholz at scriptolutions dot com
Lothar Scholz http://www.ruby-ide.com
CTO Scriptolutions Ruby, PHP, Python IDE 's

Ah...yeah, I suspected I was just stating the obvious :slight_smile:

Good luck with the solution, though.

   Kevin

···

Ara.T.Howard@noaa.gov wrote:

You might want to look at the section 7 on http://www.sqlite.org/faq.html\.

i have. :wink:

Lothar,

And now you see the difference between good working (and expensive)
commercial tools and freeware tools like valgrid.

I've heard before that Purify is good, but I don't have any experience
with it myself, and it might not be an option for everyone because of
the cost.

(besides, I don't think that commercial tools are necessarily bad and
free tools are necessarily good, or the other way around...)

But to be honest your make valgrind worse then it is. The slowdown
should be a factor 10 to 20.

Ah.. that's probably because I used 'callgrind' recently which is also
a skin for valgrind and probably more expensive than the memcheck
skin. I guess it also depends on the kind of code that's run.

Ruben

You might want to look at the section 7 on http://www.sqlite.org/faq.html\.

i have. :wink:

Ah...yeah, I suspected I was just stating the obvious :slight_smile:

better to assume nothing when debugging though - i AM grasping at straws so
i'm overlooking nothing. i went back and re-read the docs at your suggestion
- now i'm re-reading the sqlite_close code.

Good luck with the solution, though.

luck would be nice.

regards.

-a

···

On Wed, 22 Sep 2004, Kevin McConnell wrote:

Ara.T.Howard@noaa.gov wrote:

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================

Just had one other suggestion (hopefully more useful than the last :slight_smile:

Could you seperate out the db-related code into a little 'proxy' app, to run on the same machine as where the db files are, and have your clients connect to it (to read the job, submit the pid etc) ? It might help solve any potential locking hassles (if that's even the problem), since the only thing touching the database would be local. And hey, if nothing else, it could be interesting to find out which side of the code coredumps :slight_smile:

Cheers,
  Kevin

i'm now looking at using detach.rb, which creates a drb object out of any
existing object. basically it would be a little servlet for the daemon's use
only. i think this may be the way to go. thanks for the idea.

regards.

-a

···

On Wed, 22 Sep 2004, Kevin McConnell wrote:

Just had one other suggestion (hopefully more useful than the last :slight_smile:

Could you seperate out the db-related code into a little 'proxy' app, to run
on the same machine as where the db files are, and have your clients connect
to it (to read the job, submit the pid etc) ? It might help solve any
potential locking hassles (if that's even the problem), since the only thing
touching the database would be local. And hey, if nothing else, it could be
interesting to find out which side of the code coredumps :slight_smile:

--

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
A flower falls, even though we love it;
and a weed grows, even though we do not love it. --Dogen

===============================================================================