Rails applications that use FCGI have been observing some strange behavior.
I have a hypothesis regarding the cause, but I'd like some feedback as to
whether it is a reasonable hypothesis, and any solutions/workarounds that
people might have.
on which platforms?
Sometimes (and some apps experience this more frequently than others) a FCGI
process that is not currently handling a request will fail to respond to a
signal (specifically USR1 or HUP) until a request is received.
just to clarify - a fcgi process is __always__ handling a request. for
instance, if i run this code as a fcgi process:
[ahoward@localhost html]$ cat ./env.fcgi
#! /usr/local/bin/ruby
require 'fcgi'
loaded, pid = Time::now, Process::pid
FCGI.each_cgi do |cgi|
env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
content = <<-html
LOADED @ #{ loaded } <br>\n
PID @ #{ pid } <br>\n
<hr><hr>
#{ env }
html
cgi.out{ content }
end
[ahoward@localhost html]$ links -dump http://localhost/env.fcgi |grep PID
PID @ 12568
and then check that process
[root@localhost ahoward]# strace -p 12568
Process 12568 attached - interrupt to quit
select(1, [0], NULL, NULL, NULL ...
is see it's waiting for a request and blocked in select to io multiplex.
checking os_unix.c in the fcgi lib source we see
void OS_ShutdownPending()
{
shutdownPending = TRUE;
}
static void OS_Sigusr1Handler(int signo)
{
OS_ShutdownPending();
}
...
int OS_Accept(int listen_sock, int fail_on_intr, const char *webServerAddrs)
{
int socket = -1;
union {
struct sockaddr_un un;
struct sockaddr_in in;
} sa;
for (; {
if (AcquireLock(listen_sock, fail_on_intr))
return -1;
for (; {
do {
#ifdef HAVE_SOCKLEN
socklen_t len = sizeof(sa);
#else
int len = sizeof(sa);
#endif
if (shutdownPending) break;
/* There's a window here */
socket = accept(listen_sock, (struct sockaddr *)&sa, &len);
} while (socket < 0
&& errno == EINTR
&& ! fail_on_intr
&& ! shutdownPending);
...
so it seems that the signal handler sets a global flag which is checked at
appropriate times. we can send a signal to the process and see what happens:
[root@localhost html]# kill -HUP 12568
and, back in our strace window we see:
--- SIGHUP (Hangup) @ 0 (0) ---
rt_sigprocmask(SIG_SETMASK, , NULL, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL}, {0x80a4884, , SA_RESTART}, 8) = 0
exit_group(1) = ?
Process 12568 detached
looks fine - so it does, in fact, receive and handle the signal asap. but
wait a minute.... it exited with 1 for failure. checking the apache logs we
see :
[root@localhost ahoward]# tail -2 /var/log/httpd/error_log
[Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" (pid 12568) terminated by calling exit with status '1'
[Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" restarted (pid 12614)
seems __ok__. but let's do it a few times:
[root@localhost html]# echo `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`
12614
[root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`
now we check the logs:
[root@localhost ahoward]# tail -2 /var/log/httpd/error_log
[Thu Sep 15 10:15:34 2005] [error] [client 127.0.0.1] FastCGI: incomplete headers (0 bytes) received from server "/var/www/html/env.fcgi"
[Thu Sep 15 10:15:34 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds
so now the bloody thing won't run for ten minutes! the apache process manager
is prevent rapid startup/shutdown by buggy fcgi processes and this makes sense
since thousands of them could hose a system.
but, let's assume we sometimes want to shutdown nicely and know what we are
doing. we run this:
[ahoward@localhost html]$ cat env2.fcgi
#! /usr/local/bin/ruby
require 'fcgi'
trap('USR2'){ exit 0 }
loaded, pid = Time::now, Process::pid
FCGI.each_cgi do |cgi|
env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
content = <<-html
LOADED @ #{ loaded } <br>\n
PID @ #{ pid } <br>\n
<hr><hr>
#{ env }
html
cgi.out{ content }
end
[ahoward@localhost html]$ lynx -dump http://localhost/env2.fcgi |grep PID
PID @ 12690
note that this one exits, doing no cleanup, immediately with success if it gets
USR2. let's test it out:
[root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`
checking the log
[root@localhost ahoward]# tail -2 /var/log/httpd/error_log
[Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12865) terminated by calling exit with status '0'
[Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12877)
[Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12877) terminated by calling exit with status '0'
[Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12883)
[Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12883) terminated by calling exit with status '0'
[Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds
so this is better - at least we got a few restarts out of it once by exiting
with zero - the process manager thought this was ok and just logged it.
however, restarting too rapidly caused us to be backed off into oblivion.
there are config options to control this, but consider setting them to NOT
backoff - a typo in a script would cause a loop in the webserver where is just
tried over and over to restart the app. a bunch of these could easily bring a
system to it's knees. so i'm thinking that 'fixing' this problem would create
a far worse one with system crashing implications.
so i'm not sure what to do, but adding a signal handler that exits with sucess
may be a start in the right direction. this would allow nice restarts so long
as you didn't do them too quickly. if you are doing them too quickly you
really shouldn't be hitting the fcgi page anyhow so maybe this is good enough.
so... all that is totally nix/apache specific and i'd imagine none of it would
work in windows. but maybe it's a start
please let me know if you end up learning more - i'll apply anything i find to
my acgi package since all the same things apply there.
cheers.
-a
···
On Thu, 15 Sep 2005, Jamis Buck wrote:
--
email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna
===============================================================================