FCGI not responding to signals

Rails applications that use FCGI have been observing some strange behavior. I have a hypothesis regarding the cause, but I'd like some feedback as to whether it is a reasonable hypothesis, and any solutions/workarounds that people might have.

Sometimes (and some apps experience this more frequently than others) a FCGI process that is not currently handling a request will fail to respond to a signal (specifically USR1 or HUP) until a request is received. This is problematic when updating an application, because you typically want to gracefully terminate all existing FCGI processes and start up some new ones pointing at your updated code. But some (or many) of the processes don't respond until a request is received, meaning the user can get anything from a stale version of your app, to a 500 error, depending on how well-behaved (or ill-behaved) the FCGI process is.

Currently, Rails uses a "nudge" approach (EXTREMELY hacky) to handle this. When an application is restarted, you send _n_ requests to the application with the assumption that those requests will be sufficient to trigger the sleeping processes and let them gracefully terminate. The problem is, it doesn't work very well, especially in the case of Apache- or Lighttpd-managed FCGI processes. And even independently-managed FCGI processes will sometimes croak with this approach.

My hypothesis regarding the cause of the unresponsiveness is this (and please feel free to gently debunk it--I'm not ashamed to admit that I'm in somewhat over my head here): the processes in question are stuck on some IO-bound process (like listening on a socket), and Ruby is blocking until that finishes. This prevents Ruby from invoking the signal handler callback until the IO finishes. Sounds reasonable? If not, any other ideas what might be causing it?

And even more importantly, is there a sane way to work around (or better yet, _fix_) this problem? It's a rather nasty stumbling block to automated application deployment.

Thanks for any help,

Jamis

Rails applications that use FCGI have been observing some strange behavior.
I have a hypothesis regarding the cause, but I'd like some feedback as to
whether it is a reasonable hypothesis, and any solutions/workarounds that
people might have.

on which platforms?

Sometimes (and some apps experience this more frequently than others) a FCGI
process that is not currently handling a request will fail to respond to a
signal (specifically USR1 or HUP) until a request is received.

just to clarify - a fcgi process is __always__ handling a request. for
instance, if i run this code as a fcgi process:

   [ahoward@localhost html]$ cat ./env.fcgi
   #! /usr/local/bin/ruby
   require 'fcgi'
   loaded, pid = Time::now, Process::pid
   FCGI.each_cgi do |cgi|
     env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
     content = <<-html
       LOADED @ #{ loaded } <br>\n
       PID @ #{ pid } <br>\n
       <hr><hr>
       #{ env }
     html
     cgi.out{ content }
   end

   [ahoward@localhost html]$ links -dump http://localhost/env.fcgi |grep PID
      PID @ 12568

and then check that process

   [root@localhost ahoward]# strace -p 12568
   Process 12568 attached - interrupt to quit
   select(1, [0], NULL, NULL, NULL ...

is see it's waiting for a request and blocked in select to io multiplex.
checking os_unix.c in the fcgi lib source we see

   void OS_ShutdownPending()
   {
       shutdownPending = TRUE;
   }
   static void OS_Sigusr1Handler(int signo)
   {
       OS_ShutdownPending();
   }

   ...

   int OS_Accept(int listen_sock, int fail_on_intr, const char *webServerAddrs)
   {
       int socket = -1;
       union {
           struct sockaddr_un un;
           struct sockaddr_in in;
       } sa;

       for (;:wink: {
           if (AcquireLock(listen_sock, fail_on_intr))
               return -1;

           for (;:wink: {
               do {
   #ifdef HAVE_SOCKLEN
                   socklen_t len = sizeof(sa);
   #else
                   int len = sizeof(sa);
   #endif
                   if (shutdownPending) break;
                   /* There's a window here */

                   socket = accept(listen_sock, (struct sockaddr *)&sa, &len);
               } while (socket < 0
                        && errno == EINTR
                        && ! fail_on_intr
                        && ! shutdownPending);

   ...

so it seems that the signal handler sets a global flag which is checked at
appropriate times. we can send a signal to the process and see what happens:

   [root@localhost html]# kill -HUP 12568

and, back in our strace window we see:

   --- SIGHUP (Hangup) @ 0 (0) ---
   rt_sigprocmask(SIG_SETMASK, , NULL, 8) = 0
   rt_sigaction(SIGINT, {SIG_DFL}, {0x80a4884, , SA_RESTART}, 8) = 0
   exit_group(1) = ?
   Process 12568 detached

looks fine - so it does, in fact, receive and handle the signal asap. but
wait a minute.... it exited with 1 for failure. checking the apache logs we
see :

   [root@localhost ahoward]# tail -2 /var/log/httpd/error_log
   [Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" (pid 12568) terminated by calling exit with status '1'
   [Thu Sep 15 10:10:42 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" restarted (pid 12614)

seems __ok__. but let's do it a few times:

   [root@localhost html]# echo `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`
   12614
   [root@localhost html]# kill -HUP `links -dump http://localhost/env.fcgi |grep PID|sed 's/[^0-9]//g'`

now we check the logs:

   [root@localhost ahoward]# tail -2 /var/log/httpd/error_log
   [Thu Sep 15 10:15:34 2005] [error] [client 127.0.0.1] FastCGI: incomplete headers (0 bytes) received from server "/var/www/html/env.fcgi"
   [Thu Sep 15 10:15:34 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds

so now the bloody thing won't run for ten minutes! the apache process manager
is prevent rapid startup/shutdown by buggy fcgi processes and this makes sense
since thousands of them could hose a system.

but, let's assume we sometimes want to shutdown nicely and know what we are
doing. we run this:

   [ahoward@localhost html]$ cat env2.fcgi
   #! /usr/local/bin/ruby
   require 'fcgi'
   trap('USR2'){ exit 0 }
   loaded, pid = Time::now, Process::pid
   FCGI.each_cgi do |cgi|
     env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
     content = <<-html
       LOADED @ #{ loaded } <br>\n
       PID @ #{ pid } <br>\n
       <hr><hr>
       #{ env }
     html
     cgi.out{ content }
   end

   [ahoward@localhost html]$ lynx -dump http://localhost/env2.fcgi |grep PID
      PID @ 12690

note that this one exits, doing no cleanup, immediately with success if it gets
USR2. let's test it out:

   [root@localhost html]# kill -USR2 `links -dump http://localhost/env2.fcgi |grep PID|sed 's/[^0-9]//g'`

checking the log

   [root@localhost ahoward]# tail -2 /var/log/httpd/error_log
   [Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12865) terminated by calling exit with status '0'
   [Thu Sep 15 10:40:06 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12877)
   [Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12877) terminated by calling exit with status '0'
   [Thu Sep 15 10:40:11 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" restarted (pid 12883)
   [Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" (pid 12883) terminated by calling exit with status '0'
   [Thu Sep 15 10:40:15 2005] [warn] FastCGI: (dynamic) server "/var/www/html/env2.fcgi" has failed to remain running for 30 seconds given 3 attempts, its restart interval has been backed off to 600 seconds

so this is better - at least we got a few restarts out of it once by exiting
with zero - the process manager thought this was ok and just logged it.
however, restarting too rapidly caused us to be backed off into oblivion.
there are config options to control this, but consider setting them to NOT
backoff - a typo in a script would cause a loop in the webserver where is just
tried over and over to restart the app. a bunch of these could easily bring a
system to it's knees. so i'm thinking that 'fixing' this problem would create
a far worse one with system crashing implications.

so i'm not sure what to do, but adding a signal handler that exits with sucess
may be a start in the right direction. this would allow nice restarts so long
as you didn't do them too quickly. if you are doing them too quickly you
really shouldn't be hitting the fcgi page anyhow so maybe this is good enough.

so... all that is totally nix/apache specific and i'd imagine none of it would
work in windows. but maybe it's a start :wink:

please let me know if you end up learning more - i'll apply anything i find to
my acgi package since all the same things apply there.

cheers.

-a

···

On Thu, 15 Sep 2005, Jamis Buck wrote:
--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

<snip trouble restarting fcgi apps>

how about a completely different approach to restarting - restart without
exiting using the 'exec' system call. this won't return an exit status to the
fcgi pm and, i think, therefore won't cause trouble:

     [ahoward@localhost html]$ cat reloadable.fcgi
     #! /usr/local/bin/ruby
     require 'fcgi'
     loaded, pid = Time::now, Process::pid

     FCGI.each_cgi do |cgi|
       env = cgi.env_table.sort.map{|kv| kv.join " = "}.join " <br>\n"
       content = <<-html
         command_line : #{ $command_line } <br>
         loaded : #{ loaded } <br>
         pid : #{ pid } <br>
         <hr><hr>
         #{ env }
       html
       cgi.out{ content }
     end

     BEGIN {
       require 'rbconfig'
       $config = ::Config::CONFIG
       $ruby = File::join($config['bindir'], $config['ruby_install_name']) + $config['EXEEXT']
       $this = $0
       $command_line = [$ruby, $this, ARGV].flatten.join(' ')
       trap('USR2'){ exec $command_line }
     }

     [ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
        loaded : Thu Sep 15 12:46:36 MDT 2005
        pid : 16018

     [ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
        loaded : Thu Sep 15 12:46:36 MDT 2005
        pid : 16018

so we are running in fastcgi mode, the process has been loaded only once. force a restart:

     [ahoward@localhost html]$ sudo kill -USR2 16018

     [ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
        loaded : Thu Sep 15 12:47:33 MDT 2005
        pid : 16018

and it works!

     [ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
        loaded : Thu Sep 15 12:47:33 MDT 2005
        pid : 16018

and sticks.

     [ahoward@localhost html]$ sudo kill -USR2 16018

     [ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
        loaded : Thu Sep 15 12:47:43 MDT 2005
        pid : 16018

     [ahoward@localhost html]$ lynx -dump http://localhost/reloadable.fcgi |egrep 'loaded|pid'
        loaded : Thu Sep 15 12:47:43 MDT 2005
        pid : 16018

and works again.

checking the log

     [ahoward@localhost html]$ sudo tail -3 /var/log/httpd/error_log
     [Thu Sep 15 12:47:25 2005] [warn] (32)Broken pipe: FastCGI: write() to PM failed (ignore if a restart or shutdown is pending)
     [Thu Sep 15 12:47:40 2005] [warn] (32)Broken pipe: FastCGI: write() to PM failed (ignore if a restart or shutdown is pending)
     [Thu Sep 15 12:47:49 2005] [warn] (32)Broken pipe: FastCGI: write() to PM failed (ignore if a restart or shutdown is pending)

so i guess i'll ignore it.

-a

···

On Thu, 15 Sep 2005, Jamis Buck wrote:

--

email :: ara [dot] t [dot] howard [at] noaa [dot] gov
phone :: 303.497.6469
Your life dwells amoung the causes of death
Like a lamp standing in a strong breeze. --Nagarjuna

===============================================================================

Hi,

  In mail "FCGI not responding to signals"

Sometimes (and some apps experience this more frequently than others)
a FCGI process that is not currently handling a request will fail to
respond to a signal (specifically USR1 or HUP) until a request is
received. This is problematic when updating an application, because

And even more importantly, is there a sane way to work around (or
better yet, _fix_) this problem? It's a rather nasty stumbling block
to automated application deployment.

If you are using pure Ruby fcgi.rb, try latest CVS version.
I'm running FastCGI process in some monthes, but I have
not experienced any signal problem with latest one. You can
get it from my repository:

  $ cvs -d :pserver:anonymous@cvs.loveruby.net:/src co bitchannel

fcgi.rb is in lib/.

Regards,
Minero Aoki

···

Jamis Buck <jamis@37signals.com> wrote: