Hello there -
I manage some servers that run Ruby on Rails on top of Apache 2.0 and
FastCGI. Been running them for a little over a year now, and recently
upgraded OS/Ruby/FastCGI and have encountered an unexpected problem that
I've been having trouble trying to trace down.
The servers have 8 cores, 8GB memory, and run about a dozen ruby
on rails applications, each application has a dedicated apache
instance with dedicated fastcgi.
Original configuration(stable, despite some memory leaks):
Fedora Core 4 32-bit
Apache 2.0.54
mod_fastcgi 2.4.2
Ruby 1.8.4
- a wide assortment of approx 25 gems and other ruby add-ons
(I can provide a list if needed)
The applications are all in-house apps. I didn't write them, I
just manage the environment. In this configuration the servers
are rock solid stable, though most active application leaks
memory like crazy and is auto restarted once or twice a day.
New configuration
CentOS 4.5 32-bit
Apache 2.0.52
mod_fastcgi 2.4.2
Ruby 1.8.5
- same wide assortment of approx 25 gems and other add-ons that
were recompiled against the above version of ruby.
The problem is after random periods of time(3-20 hours) two of
the most busy ruby apps on the boxes(account for 80% of the
load) start going into uninterruptible sleep("D") for no
apparent reason. There is no I/O problem, the boxes run at
~20% CPU, have real fast local disk subsystems, most of what
these apps do are talk to databases(on other systems). No
errors in the log. One of the apps does a lot of stuff over
NFS, but the NFS server is fine(been running straight for
the past 6 months without a glitch), so I don't doubt any
of the infrastructure since it only happens on the updated
software, and not at the same time, one server will go bad
at one point then maybe an hour or two later another one
will go. All 4 systems are load balanced and get an even
distribution of hits.
One of the four systems hasn't had a problem in about 60
hours since I fixed a mod_fastcgi issue, our automation
system(CFEngine) was pushing out a binary of mod_fastcgi
that was compiled against fedora core 4 to the CentOS systems,
once I fixed that and pushed a mod_fastcgi compiled against
CentOS that seemed to stablize two of the systems(one's been
running fine ever since, the other ran for about 18 hours
before having an issue). Prior to that mod_fastcgi change
both systems were going out at least once every 6 hours.
So, two systems have the symptoms real bad, one is really
not bad at all, and one hasn't shown anything in more than
2 days. All are identical(logic suggests there must be
something different but our installation processes are so
automated that there really is not much chance of there
being things different).
I don't know if it's mod_fastcgi, maybe it's Ruby, or maybe
it's one of the modules.. We've been running the same config
in a test environment for more than a month without the
slightest problem. But at the same time we ran 64-bit Ruby
in the same test environment for more than a month and it
lasted less than an hour in production before we had to
roll back(system performed a good 60-75% slower under
load than in 32-bit).
I've worked on this a good 6-8 hours over the past week,
trying to troubleshoot it, searching the net for anyone
else that might of had the same problem with no luck.
In the mean time I've been playing with mod_fcgid, got
it running but some of the basic 'smoke tests' that run
against our app fail on a particular part of the app with
mod_fcgid but not with mod_fastcgi. Haven't had much time
to look in-depth on it, Ruby just reports 'transaction
aborted', for some reason(even after raising timeouts in
mod_fcgid).
I'm sure this is a lot to take in but I'm grasping at
straws here, hoping that maybe someone has some insight,
of all platforms I would of thought the newer one would
of been more solid than the older one.
While I've been managing the systems that run these Ruby
apps, I'm by no means a Ruby developer, though I have
been using Linux and stuff for about 12 years and
consider myself fairly adept at it. I haven't encountered
this type of situation before myself, nor has my other
very experienced system admin.
And before anyone suggest ditch mod_fastcgi, I've already
thought about it and we probably will at some point,
though, until now it's worked surprisingly well(I've
read about lots of complaints about setting it up, for
us it's a breeze). Worst case I can roll back to the
older OS and config in a couple of hours but I really
want to ditch that old distribution..
I plan to post to the fastcgi mailing list as well, still
waiting to get subscribed to it ..
thanks in advance for any insights..
nate