Ruby OOM in container

Hello,

We've encountered a problem with Ruby runtime recently. I hope someone
would be interested in understanding the situation and offering a
solution to this problem :slight_smile:

We use fluentd as logging agent inside a docker container with limited
amount of memory and some time ago we noticed that these containers
started to crash with OOM. In the system logs it says that container
exceeded it's memory limit and was killed by oom-killer. Amount of
memory actually used and needed by the fluentd is small, but after
some time memory consumption grows beyond cgroup limit.

After some time investigating we narrowed down our problem to the
following line. If we can make this like work as we expect (without
OOMs), then our task is completed :slight_smile:

docker run -ti -m 209715200 ruby:2.1 ruby -e 'while true do array =
[]; 3000000.times do array << "hey" end; puts array.length; end;'

After one iteration this process contains ~150MB of data. After
several of them it's killed because of OOM. Funny thing is, it's
working without OOMs on some systems, we weren't able to deduce why,
maybe in the process of understand what to do with this problems you
can come up with the explanation.

Systems, where it works (though slowly) without OOMs: Fedora 23/Docker
1.12, OS X El Capitan/Docker 1.12. System where it crashes after
couple of seconds: Ubuntu 14.10/Docker 1.12 and 1.9, Cent OS 7/Docker
1.12 and 1.9.

Couple of remarks. Since it's not our application, but rather some OSS
with community plugins, we cannot drastically change the code, insert
some manual GCs and so on. Another thing is, we use CRuby and we
cannot move to JRuby because some of the gems, used by fluentd are not
compatible with it. And we don't want to just increase the memory
limits, because it doesn't seems like a solutions, rather an attempt
to avoid finding a solution. Just throwing memory at the container
won't prevent this from happening again. Additionally, 200 MB memory
on logs aggregator seems like a lot already.

We've already tried to change some ruby GC environment variables, but
for now nothing worked. Though if you have some combination of these
parameters that makes the line above consistently work, that'd be
really great!

We would really appreciate some help in finding a root cause of these
OOMs and fixing it.

I'm ready to answer any questions you may have.

Thanks in advance,
Mik

P.S. We've asked a questions on StackOverflow, if you're more
comfortable with it, I would be happy to see your thoughts there:

I鈥檓 pretty sure OOM policy is controlled and set with the OS kernel. Figure out how the 14.10 and CentOS 7 machines differs from the Fedora 23 configuration.

路路路

On 27 Oct 2016, at 10:22, Mik Vyatskov wrote:

Systems, where it works (though slowly) without OOMs: Fedora 23/Docker
1.12, OS X El Capitan/Docker 1.12. System where it crashes after
couple of seconds: Ubuntu 14.10/Docker 1.12 and 1.9, Cent OS 7/Docker
1.12 and 1.9.

I'm not a guru, but would freezing the string save you some memory?

(Replying from phone, so apologies for no link, sample, etc).

-James

路路路

On Thu, Oct 27, 2016, 07:22 Mik Vyatskov <vmik@google.com> wrote:

Hello,

We've encountered a problem with Ruby runtime recently. I hope someone
would be interested in understanding the situation and offering a
solution to this problem :slight_smile:

We use fluentd as logging agent inside a docker container with limited
amount of memory and some time ago we noticed that these containers
started to crash with OOM. In the system logs it says that container
exceeded it's memory limit and was killed by oom-killer. Amount of
memory actually used and needed by the fluentd is small, but after
some time memory consumption grows beyond cgroup limit.

After some time investigating we narrowed down our problem to the
following line. If we can make this like work as we expect (without
OOMs), then our task is completed :slight_smile:

docker run -ti -m 209715200 ruby:2.1 ruby -e 'while true do array =
[]; 3000000.times do array << "hey" end; puts array.length; end;'

After one iteration this process contains ~150MB of data. After
several of them it's killed because of OOM. Funny thing is, it's
working without OOMs on some systems, we weren't able to deduce why,
maybe in the process of understand what to do with this problems you
can come up with the explanation.

Systems, where it works (though slowly) without OOMs: Fedora 23/Docker
1.12, OS X El Capitan/Docker 1.12. System where it crashes after
couple of seconds: Ubuntu 14.10/Docker 1.12 and 1.9, Cent OS 7/Docker
1.12 and 1.9.

Couple of remarks. Since it's not our application, but rather some OSS
with community plugins, we cannot drastically change the code, insert
some manual GCs and so on. Another thing is, we use CRuby and we
cannot move to JRuby because some of the gems, used by fluentd are not
compatible with it. And we don't want to just increase the memory
limits, because it doesn't seems like a solutions, rather an attempt
to avoid finding a solution. Just throwing memory at the container
won't prevent this from happening again. Additionally, 200 MB memory
on logs aggregator seems like a lot already.

We've already tried to change some ruby GC environment variables, but
for now nothing worked. Though if you have some combination of these
parameters that makes the line above consistently work, that'd be
really great!

We would really appreciate some help in finding a root cause of these
OOMs and fixing it.

I'm ready to answer any questions you may have.

Thanks in advance,
Mik

P.S. We've asked a questions on StackOverflow, if you're more
comfortable with it, I would be happy to see your thoughts there:
http://stackoverflow.com/questions/40268749/ruby-oom-in-container

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

I think it can help in the test sample I attached, but our problem is
with the fluentd, not mentioned code per se.

We would love to find a general solution that does not require
changing the code.

路路路

On Thu, Oct 27, 2016 at 4:32 PM, James Pacheco <james.pacheco@gmail.com> wrote:

I'm not a guru, but would freezing the string save you some memory?

(Replying from phone, so apologies for no link, sample, etc).

-James

On Thu, Oct 27, 2016, 07:22 Mik Vyatskov <vmik@google.com> wrote:

Hello,

We've encountered a problem with Ruby runtime recently. I hope someone
would be interested in understanding the situation and offering a
solution to this problem :slight_smile:

We use fluentd as logging agent inside a docker container with limited
amount of memory and some time ago we noticed that these containers
started to crash with OOM. In the system logs it says that container
exceeded it's memory limit and was killed by oom-killer. Amount of
memory actually used and needed by the fluentd is small, but after
some time memory consumption grows beyond cgroup limit.

After some time investigating we narrowed down our problem to the
following line. If we can make this like work as we expect (without
OOMs), then our task is completed :slight_smile:

docker run -ti -m 209715200 ruby:2.1 ruby -e 'while true do array =
[]; 3000000.times do array << "hey" end; puts array.length; end;'

After one iteration this process contains ~150MB of data. After
several of them it's killed because of OOM. Funny thing is, it's
working without OOMs on some systems, we weren't able to deduce why,
maybe in the process of understand what to do with this problems you
can come up with the explanation.

Systems, where it works (though slowly) without OOMs: Fedora 23/Docker
1.12, OS X El Capitan/Docker 1.12. System where it crashes after
couple of seconds: Ubuntu 14.10/Docker 1.12 and 1.9, Cent OS 7/Docker
1.12 and 1.9.

Couple of remarks. Since it's not our application, but rather some OSS
with community plugins, we cannot drastically change the code, insert
some manual GCs and so on. Another thing is, we use CRuby and we
cannot move to JRuby because some of the gems, used by fluentd are not
compatible with it. And we don't want to just increase the memory
limits, because it doesn't seems like a solutions, rather an attempt
to avoid finding a solution. Just throwing memory at the container
won't prevent this from happening again. Additionally, 200 MB memory
on logs aggregator seems like a lot already.

We've already tried to change some ruby GC environment variables, but
for now nothing worked. Though if you have some combination of these
parameters that makes the line above consistently work, that'd be
really great!

We would really appreciate some help in finding a root cause of these
OOMs and fixing it.

I'm ready to answer any questions you may have.

Thanks in advance,
Mik

P.S. We've asked a questions on StackOverflow, if you're more
comfortable with it, I would be happy to see your thoughts there:
http://stackoverflow.com/questions/40268749/ruby-oom-in-container

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

docker run -ti -m 209715200 ruby:2.1 ruby -e 'while true do array =
[]; 3000000.times do array << "hey" end; puts array.length; end;'

What happens if you put "array.clear" at the end of the loop?
That should free up the 3000000 slots in the array right away
once you know you're done with it.

If you're working with large strings/arrays/hashes, calling
.clear on them once you're done should correspond to calling
free(3) to release memory from the malloc implementation.

And changing "hey" to "hey".freeze will save object allocation
in Ruby 2.1 and later (as James suggested); but I guess that
won't affect real-world use.

Systems, where it works (though slowly) without OOMs: Fedora 23/Docker
1.12, OS X El Capitan/Docker 1.12. System where it crashes after
couple of seconds: Ubuntu 14.10/Docker 1.12 and 1.9, Cent OS 7/Docker
1.12 and 1.9.

For the glibc systems, are they all running different versions
of glibc? There could be glibc malloc changes, or different
malloc implementations (such as jemalloc) which could also be in
play. I'd check both the malloc implementation+version
differences, versions as well as vendor-specific patches/tweaks.

For performance reasons, malloc implementations do not release
memory back to the kernel often (or at all). That seems to
interact badly with GC implementations which delay and batch
memory release to malloc (also for performance reasons).

So, IMHO, having some degree of manual memory control (via
.clear) helps greatly. You will suffer performance loss if
overused, but I'd rather have slower but consistent
performance+memory usage rather than unpredictable behavior (or
OOM).

(Sbe gung ernfba, V fgvyy pubbfr Crey5 va znal cebwrpgf nf V'q
engure qrfvta nebhaq pvephyne ersreraprf guna
qvssvphyg-gb-cerqvpg TP+znyybp vagrenpgvbaf naq fgnyyf)

P.S. We've asked a questions on StackOverflow, if you're more
comfortable with it, I would be happy to see your thoughts there:
http://stackoverflow.com/questions/40268749/ruby-oom-in-container

I do not exist outside of email :>
Feel free to link/copy this response in other forums, though.

路路路

Mik Vyatskov <vmik@google.com> wrote:

> We use fluentd as logging agent inside a docker container with limited

Which version? msgpack v0.4 had fragmentation issues, but it's quite old
now.

> memory actually used and needed by the fluentd is small, but after
> some time memory consumption grows beyond cgroup limit.

> Couple of remarks. Since it's not our application, but rather some OSS
> with community plugins, we cannot drastically change the code, insert

Are these fluentd extensions or purely plugins for the OSS? Are you using
any fluentd plugins?

Paul Mak

路路路

Mik Vyatskov <vmik@google.com> wrote:

but I guess that won't affect real-world use.

Yes, thanks for the advice, but we want to solve this problem without
changing the code.

For the glibc systems, are they all running different versions
of glibc? There could be glibc malloc changes, or different
malloc implementations (such as jemalloc) which could also be in
play. I'd check both the malloc implementation+version
differences, versions as well as vendor-specific patches/tweaks.

For performance reasons, malloc implementations do not release
memory back to the kernel often (or at all). That seems to
interact badly with GC implementations which delay and batch
memory release to malloc (also for performance reasons).

So, IMHO, having some degree of manual memory control (via
.clear) helps greatly. You will suffer performance loss if
overused, but I'd rather have slower but consistent
performance+memory usage rather than unpredictable behavior (or
OOM).

Unfortunately, the glibc version on all systems was the same 鈥 2.19
and we didn't use any specific malloc implementations in our tests, so
it doesn't explain the difference :frowning: But thanks for the suggesting!
Probably somewhere here lies the solution.

路路路

On Thu, Oct 27, 2016 at 9:55 PM, Eric Wong <e@80x24.org> wrote:

Mik Vyatskov <vmik@google.com> wrote:

docker run -ti -m 209715200 ruby:2.1 ruby -e 'while true do array =
[]; 3000000.times do array << "hey" end; puts array.length; end;'

What happens if you put "array.clear" at the end of the loop?
That should free up the 3000000 slots in the array right away
once you know you're done with it.

If you're working with large strings/arrays/hashes, calling
.clear on them once you're done should correspond to calling
free(3) to release memory from the malloc implementation.

And changing "hey" to "hey".freeze will save object allocation
in Ruby 2.1 and later (as James suggested); but I guess that
won't affect real-world use.

Systems, where it works (though slowly) without OOMs: Fedora 23/Docker
1.12, OS X El Capitan/Docker 1.12. System where it crashes after
couple of seconds: Ubuntu 14.10/Docker 1.12 and 1.9, Cent OS 7/Docker
1.12 and 1.9.

For the glibc systems, are they all running different versions
of glibc? There could be glibc malloc changes, or different
malloc implementations (such as jemalloc) which could also be in
play. I'd check both the malloc implementation+version
differences, versions as well as vendor-specific patches/tweaks.

For performance reasons, malloc implementations do not release
memory back to the kernel often (or at all). That seems to
interact badly with GC implementations which delay and batch
memory release to malloc (also for performance reasons).

So, IMHO, having some degree of manual memory control (via
.clear) helps greatly. You will suffer performance loss if
overused, but I'd rather have slower but consistent
performance+memory usage rather than unpredictable behavior (or
OOM).

(Sbe gung ernfba, V fgvyy pubbfr Crey5 va znal cebwrpgf nf V'q
engure qrfvta nebhaq pvephyne ersreraprf guna
qvssvphyg-gb-cerqvpg TP+znyybp vagrenpgvbaf naq fgnyyf)

P.S. We've asked a questions on StackOverflow, if you're more
comfortable with it, I would be happy to see your thoughts there:
http://stackoverflow.com/questions/40268749/ruby-oom-in-container

I do not exist outside of email :>
Feel free to link/copy this response in other forums, though.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Which version? msgpack v0.4 had fragmentation issues, but it's quite old now.

Fluentd v0.12.29, it's msgpack v0.5.11 and above. Thanks for the
suggesting though!

Are these fluentd extensions or purely plugins for the OSS? Are you using any fluentd plugins?

Plugins for the OSS, elasticsearch and google-cloud, both fail with OOM.
We also use fluent-plugin-record-reformer for changing tags and
fluent-plugin-systemd for reading from journald.

路路路

On Fri, Oct 28, 2016 at 2:52 AM, Paul McKibbin <pmckibbin@gmail.com> wrote:

Mik Vyatskov <vmik@google.com> wrote:

> We use fluentd as logging agent inside a docker container with limited

Which version? msgpack v0.4 had fragmentation issues, but it's quite old
now.

> memory actually used and needed by the fluentd is small, but after
> some time memory consumption grows beyond cgroup limit.

> Couple of remarks. Since it's not our application, but rather some OSS
> with community plugins, we cannot drastically change the code, insert

Are these fluentd extensions or purely plugins for the OSS? Are you using
any fluentd plugins?

Paul Mak

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>