Strange logger rotation failure

Hi all,

Ruby 1.8.2
Solaris 10

I have a couple Ruby scripts running via cron that connect to an Oracle database and execute some stored procedures - nothing fancy. On Saturday, July 2, 2005, the logger package spewed a strange error at me:

Your "cron" job on itwsp193910wss
log_cleanup_dev.rb

produced the following output:

/usr/local/lib/ruby/1.8/logger.rb:516:in `write': Shifting failed. 'log_cleanup_dev.log.20050630' already exists. (Logger::ShiftingError)
  from /usr/local/lib/ruby/1.8/logger.rb:318:in `add'
  from /usr/local/lib/ruby/1.8/logger.rb:399:in `fatal'
  from log_cleanup_dev.rb:68

First, a bit of background. There are two nearly identical scripts that run - one is log_cleanup.rb and the other is log_cleanup_dev.rb. The only difference between them is the database connect string (i.e. one connects to a production database, and the script in question connects to a dev database).

The first curious thing to me is that this error was raised on July *2nd* (it also happened on July 8th, btw). The second (less) curious thing is that this was only raised by the dev script. It's less curious, because the reason the script failed in both instances was because the DBA was doing some maintenance on the dev database on those mornings.

# Relevant snippet from log_cleanup_dev.rb (line 68 noted):

# Logger instance
log = Logger.new("log_cleanup_dev.log", "monthly")

# and later...
begin
    sth = dbh.prepare("begin my_proc; end;")
    sth.execute
rescue DBI::Error => e
    msg = "Error executing stored procedure 'my_proc': #{e.errstr}"
    log.fatal{ msg } # Line 68
    raise
rescue => e
    msg = "Unexpected error executing stored procedure 'my_proc': #{e}"
    log.fatal{ msg }
    raise
ensure
    sth.finish if sth
end

Any ideas? It looks to me like it's trying to rotate when it shouldn't or something, but I haven't dug too deeply into this yet.

Regards,

Dan

probably what happened is that

   first cron starts, then hangs since db is being maintained

   second cron job starts, connects, begins logging and rolls

   first cron jobs finally blows up/times out. raises error, attempts to roll
   log, and then fails to roll it since second already did.

i've seen things like this many, many times in our production system whe
things are cron'd. basically cron jobs can stack if one or more hangs and
code rarely deals with this gracfully. of course, this may not be your issue
but looking at timestamps in the log file in question, the mtime of the log
file itself, and the cron interval should give you a clue. if it's cron'd
with any frequency i'd put money on this.

the way i've worked around this for some of my stuff, like rq and dirwatch, is
to make the code run as a daemon. the daemon aquires a lockfile and refuses
to start if one is already running. then i simply attempt to start the daemon
via cron every fifteen minutes. that one only one instance ever runs at a
time, which greatly simplifies making robust systems. it's also nice in the
case of things which are supposed to run at, say, 7pm - but don't since the
system crashed at 6:59. the cron would simply restart the daemon when the
machine comes up and then whatever work gets done gets done at 7:01 or
whatever.

another thought is nfs. mtime cannot be trusted on nfs unless you sysads are
completely stellar and totally understand nfs setups wrt inode caching -
which, in my experience, is extremely rare.

for what it's worth i use the builtin logger in around 200 production
applications and have never seen an error - even on nfs - so i'm guessing
the logger code is o.k. then again i don't use monthly rolling and don't
exercise that code. i looked at the monthly code - it look fine to me
but...

hth and good luck.

-a

···

On Tue, 12 Jul 2005, Daniel Berger wrote:

Hi all,

Ruby 1.8.2
Solaris 10

I have a couple Ruby scripts running via cron that connect to an Oracle database and execute some stored procedures - nothing fancy. On Saturday, July 2, 2005, the logger package spewed a strange error at me:

Your "cron" job on itwsp193910wss
log_cleanup_dev.rb

produced the following output:

/usr/local/lib/ruby/1.8/logger.rb:516:in `write': Shifting failed. 'log_cleanup_dev.log.20050630' already exists. (Logger::ShiftingError)
  from /usr/local/lib/ruby/1.8/logger.rb:318:in `add'
  from /usr/local/lib/ruby/1.8/logger.rb:399:in `fatal'
  from log_cleanup_dev.rb:68

First, a bit of background. There are two nearly identical scripts that run - one is log_cleanup.rb and the other is log_cleanup_dev.rb. The only difference between them is the database connect string (i.e. one connects to a production database, and the script in question connects to a dev database).

The first curious thing to me is that this error was raised on July *2nd* (it also happened on July 8th, btw). The second (less) curious thing is that this was only raised by the dev script. It's less curious, because the reason the script failed in both instances was because the DBA was doing some maintenance on the dev database on those mornings.

# Relevant snippet from log_cleanup_dev.rb (line 68 noted):

# Logger instance
log = Logger.new("log_cleanup_dev.log", "monthly")

# and later...
begin
  sth = dbh.prepare("begin my_proc; end;")
  sth.execute
rescue DBI::Error => e
  msg = "Error executing stored procedure 'my_proc': #{e.errstr}"
  log.fatal{ msg } # Line 68
  raise
rescue => e
  msg = "Unexpected error executing stored procedure 'my_proc': #{e}"
  log.fatal{ msg }
  raise
ensure
  sth.finish if sth
end

Any ideas? It looks to me like it's trying to rotate when it shouldn't or something, but I haven't dug too deeply into this yet.

Regards,

Dan

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
My religion is very simple. My religion is kindness.
--Tenzin Gyatso

===============================================================================