Debugging intermittent FastCGI failure

I've made a FastCGI application which, on the face of it, works well:
it's fast, responsive, and reasonably light on the processors.

However, at seemingly random intervals (usually a few hours apart), it
just stops responding; a (graceful) restart of Apache can bring it
back to life.

Software versions are:
Fedora Core 4 i386
Apache 2.0.54
Ruby 1.8.2 or 1.8.3 (both exhibit the same problem)
ruby-fcgi-0.8.6
FastCGI 2.4.0

Before a failure occurs, the httpd error log contains a lot of messages such as:
FastCGI: incomplete headers (0 bytes) received from server
"/.../foo.fcgi", referer: ...
FastCGI: comm with (dynamic) server "/.../foo.fcgi" aborted: (first
read) idle timeout (60 sec), referer: ...

After a failure, many stale fcgi handler scripts remain in the process
table. They can't be killed without using SIGKILL.

I'd be very grateful for any pointers in debugging this: things I
should look at, and things I could try. Combinations of FastCGI
parameters that work for you would be really helpful, too. The
application handles around 80,000 requests per day, and the hardware
really should be adequate (Quad Xeon 2.4 GHz).

Any useful suggestions will be gladly received!

Paul.

when the processes begin to hang attach to them using stract/pstrace. this
ought to give you a good indication of the problem. if that fails, configure
an fcgi external handler (one you start yourself) and run it using something
like

   ~ > screen
   ~ > gdb `which ruby` dispatch.fcgi

in otherwords run it under screen (so you can detach and come back in a few
days) and then run you fastcgi process under ruby in the debugger. next you'd
want to set a breakpoint and setup a commandline loop that did something like

   while true; curl http://localhost/mypage/dispatch.fcgi; end

so hammer the things with requests in a loop - but have a breakpoint set once
per loop so you can stop and look at things.

another option would just be to make the process dump core every now and again
and examine the core dump.

my experience with fastcgi has lead me to say that, if you install any of the
above using packages - you might want to compile by hand.

hth.

regards.

-a

ยทยทยท

On Fri, 16 Dec 2005, Paul Battley wrote:

I've made a FastCGI application which, on the face of it, works well:
it's fast, responsive, and reasonably light on the processors.

However, at seemingly random intervals (usually a few hours apart), it
just stops responding; a (graceful) restart of Apache can bring it
back to life.

Software versions are:
Fedora Core 4 i386
Apache 2.0.54
Ruby 1.8.2 or 1.8.3 (both exhibit the same problem)
ruby-fcgi-0.8.6
FastCGI 2.4.0

Before a failure occurs, the httpd error log contains a lot of messages such as:
FastCGI: incomplete headers (0 bytes) received from server
"/.../foo.fcgi", referer: ...
FastCGI: comm with (dynamic) server "/.../foo.fcgi" aborted: (first
read) idle timeout (60 sec), referer: ...

After a failure, many stale fcgi handler scripts remain in the process
table. They can't be killed without using SIGKILL.

I'd be very grateful for any pointers in debugging this: things I
should look at, and things I could try. Combinations of FastCGI
parameters that work for you would be really helpful, too. The
application handles around 80,000 requests per day, and the hardware
really should be adequate (Quad Xeon 2.4 GHz).

Any useful suggestions will be gladly received!

--

ara [dot] t [dot] howard [at] noaa [dot] gov
all happiness comes from the desire for others to be happy. all misery
comes from the desire for oneself to be happy.
-- bodhicaryavatara

===============================================================================

when the processes begin to hang attach to them using stract/pstrace. this
ought to give you a good indication of the problem. if that fails, configure
an fcgi external handler (one you start yourself) and run it using something
like

Thanks for the ideas. strace seems very helpful, and I'm trying it
out now, trying to catch a failure in progress. Unfortunately, the
problem is anything but repeatable: I've fed the app tens of thousands
of requests (both in one thread and split over several to increase the
transient load) in a short period of time without bringing it down. I
just have to wait!

It looks like I'll need to make quite a few changes to run the code
externally, but it ought to be informative if I can do it.

Thanks again,
Paul.