Hello,
We've encountered a problem with Ruby runtime recently. I hope someone
would be interested in understanding the situation and offering a
solution to this problem
We use fluentd as logging agent inside a docker container with limited
amount of memory and some time ago we noticed that these containers
started to crash with OOM. In the system logs it says that container
exceeded it's memory limit and was killed by oom-killer. Amount of
memory actually used and needed by the fluentd is small, but after
some time memory consumption grows beyond cgroup limit.
After some time investigating we narrowed down our problem to the
following line. If we can make this like work as we expect (without
OOMs), then our task is completed
docker run -ti -m 209715200 ruby:2.1 ruby -e 'while true do array =
[]; 3000000.times do array << "hey" end; puts array.length; end;'
After one iteration this process contains ~150MB of data. After
several of them it's killed because of OOM. Funny thing is, it's
working without OOMs on some systems, we weren't able to deduce why,
maybe in the process of understand what to do with this problems you
can come up with the explanation.
Systems, where it works (though slowly) without OOMs: Fedora 23/Docker
1.12, OS X El Capitan/Docker 1.12. System where it crashes after
couple of seconds: Ubuntu 14.10/Docker 1.12 and 1.9, Cent OS 7/Docker
1.12 and 1.9.
Couple of remarks. Since it's not our application, but rather some OSS
with community plugins, we cannot drastically change the code, insert
some manual GCs and so on. Another thing is, we use CRuby and we
cannot move to JRuby because some of the gems, used by fluentd are not
compatible with it. And we don't want to just increase the memory
limits, because it doesn't seems like a solutions, rather an attempt
to avoid finding a solution. Just throwing memory at the container
won't prevent this from happening again. Additionally, 200 MB memory
on logs aggregator seems like a lot already.
We've already tried to change some ruby GC environment variables, but
for now nothing worked. Though if you have some combination of these
parameters that makes the line above consistently work, that'd be
really great!
We would really appreciate some help in finding a root cause of these
OOMs and fixing it.
I'm ready to answer any questions you may have.
Thanks in advance,
Mik
P.S. We've asked a questions on StackOverflow, if you're more
comfortable with it, I would be happy to see your thoughts there: