[PATCH] mwrap 2.0.0 mwrap - LD_PRELOAD malloc wrapper for Ruby


(Eric Wong) #1

mwrap is designed to answer the question:

   Which lines of Ruby are hitting malloc the most?

mwrap wraps all malloc-family calls to trace the Ruby source
location of such calls and bytes allocated at each callsite.
As of mwrap 2.0.0, it can also function as a leak detector
and show live allocations at every call site. Depending on
your application and workload, the overhead is roughly a 50%
increase memory and runtime.

It works best for allocations under GVL, but tries to track
numeric caller addresses for allocations made without GVL so you
can get an idea of how much memory usage certain extensions and
native libraries use.

It requires the concurrent lock-free hash table from the
Userspace RCU project: https://liburcu.org/

It does not require recompiling or rebuilding Ruby, but only
supports Ruby trunk (2.6.0dev+) on a few platforms:

* GNU/Linux
* FreeBSD (tested 11.1)

It may work on NetBSD, OpenBSD and DragonFly BSD.

Changes in 2.0.0:

This release includes significant changes to track live
allocations and frees. It can find memory leaks from malloc
with less overhead than valgrind's leakchecker and there is a
new Rack endpoint (MwrapRack) which can display live allocation
stats.

API additions:

* Mwrap#[] - https://80x24.org/mwrap/Mwrap.html#method-c-5B-5D
* Mwrap::SourceLocation - https://80x24.org/mwrap/Mwrap/SourceLocation.html
* MwrapRack - https://80x24.org/mwrap/MwrapRack.html

Incompatible changes:

* Mwrap.clear now an alias to Mwrap.reset; as it's unsafe
  to implement the new Mwrap#[] API otherwise:
  https://80x24.org/mwrap-public/20180716211933.5835-12-e@80x24.org/

26 changes since v1.0.0:

      README: improve usage example
      MANIFEST: add .document
      add benchmark
      use __attribute__((weak)) instead of dlsym
      Mwrap.dump: do not segfault on invalid IO arg
      bin/mwrap: support LISTEN_FDS env from systemd
      support per-allocation headers for per-alloc tracking
      mwrap: use malloc to do our own memalign
      hold RCU read lock to insert each allocation
      realloc: do not copy if allocation failed
      internal_memalign: do not assume real_malloc succeeds
      ensure ENOMEM is preserved in errno when appropriate
      memalign: check alignment on all public functions
      reduce stack usage from file names
      resolve real_malloc earlier for C++ programs
      allow analyzing live allocations via Mwrap[location]
      alias Mwrap.clear to Mwrap.reset
      implement accessors for SourceLocation
      mwrap_aref: quiet -Wshorten-64-to-32 warning
      fixes for FreeBSD 11.1...
      use memrchr to extract address under glibc
      do not track allocations for constructor and Init_
      disable memalign tracking by default
      support Mwrap.quiet to temporarily disable allocation tracking
      mwrap_rack: Rack app to track live allocations
      documentation updates for 2.0.0 release


(Eric Wong) #2

Oops, forgot to include links, and title is [ANN] :x

Mailing list:

  https://80x24.org/mwrap-public/
  nntp://80x24.org/inbox.comp.lang.ruby.mwrap
  mailto:mwrap-public@80x24.org (no HTML mail, please)

git clone https://80x24.org/mwrap.git

homepage + rdoc: https://80x24.org/mwrap/


#3

I am using mwrap to debug a little leak at the moment, one feature
request I do have though is a tally of totals.

It would be nice if it could keep track of total allocated and total
released. That way if my RSS is bloating I can tell if it is due to
fragmentation or if it is due to a genuine leak really quick.


#4

Just to clarify here, I mean 2 single global totals, not a per row
kind of thing.

···

On Thu, Jul 26, 2018 at 11:36 AM, Sam Saffron <sam.saffron@gmail.com> wrote:

I am using mwrap to debug a little leak at the moment, one feature
request I do have though is a tally of totals.

It would be nice if it could keep track of total allocated and total
released. That way if my RSS is bloating I can tell if it is due to
fragmentation or if it is due to a genuine leak really quick.


(Eric Wong) #5

Something like the patch below? (Barely tested)

Since mwrap doesn't track its own memory usage; this might be
useful if you have a lot of cold code paths doing allocations,
since RSS might not stabilize quickly in that case.

Also, if there's a leaker using a malloc wrapper like Ruby's
xmalloc (e.g. https://bugs.ruby-lang.org/issues/14929 ) ; mwrap
won't make it easy to track down since it can only safely see
the one level up the call stack (using GCC's __builtin_return_address
with a non-zero level isn't safe)

diff --git a/ext/mwrap/mwrap.c b/ext/mwrap/mwrap.c
index acc8960..9bb44d0 100644
--- a/ext/mwrap/mwrap.c
+++ b/ext/mwrap/mwrap.c
@@ -32,6 +32,8 @@ extern size_t __attribute__((weak)) rb_gc_count(void);
extern VALUE __attribute__((weak)) rb_cObject;
extern VALUE __attribute__((weak)) rb_yield(VALUE);

+static size_t total_bytes_inc, total_bytes_dec;

···

Sam Saffron <sam.saffron@gmail.com> wrote:

Just to clarify here, I mean 2 single global totals, not a per row
kind of thing.

On Thu, Jul 26, 2018 at 11:36 AM, Sam Saffron <sam.saffron@gmail.com> wrote:
> I am using mwrap to debug a little leak at the moment, one feature
> request I do have though is a tally of totals.
>
> It would be nice if it could keep track of total allocated and total
> released. That way if my RSS is bloating I can tell if it is due to
> fragmentation or if it is due to a genuine leak really quick.

+
/* true for glibc/dlmalloc/ptmalloc, not sure about jemalloc */
#define ASSUMED_MALLOC_ALIGNMENT (sizeof(void *) * 2)

@@ -327,6 +329,8 @@ static struct src_loc *update_stats_rcu_lock(size_t size, uintptr_t caller)
   if (caa_unlikely(!totals)) return 0;
   if (locating++) goto out; /* do not recurse into another *alloc */

+ uatomic_add(&total_bytes_inc, size);
+
   rcu_read_lock();
   if (has_ec_p()) {
     int line;
@@ -390,6 +394,7 @@ void free(void *p)
     if (l) {
       size_t age = generation - h->as.live.gen;

+ uatomic_add(&total_bytes_dec, h->size);
       uatomic_set(&h->size, 0);
       uatomic_add(&l->frees, 1);
       uatomic_add(&l->age_total, age);
@@ -710,12 +715,16 @@ static VALUE mwrap_dump(int argc, VALUE * argv, VALUE mod)
   return Qnil;
}

+/* The whole operation is not remotely atomic... */
static void *totals_reset(void *ign)
{
   struct cds_lfht *t;
   struct cds_lfht_iter iter;
   struct src_loc *l;

+ uatomic_set(&total_bytes_inc, 0);
+ uatomic_set(&total_bytes_dec, 0);
+
   rcu_read_lock();
   t = rcu_dereference(totals);
   cds_lfht_for_each_entry(t, &iter, l, hnode) {
@@ -1033,6 +1042,16 @@ static VALUE mwrap_quiet(VALUE mod)
   return rb_ensure(rb_yield, SIZET2NUM(cur), reset_locating, 0);
}

+static VALUE total_inc(VALUE mod)
+{
+ return SIZET2NUM(total_bytes_inc);
+}
+
+static VALUE total_dec(VALUE mod)
+{
+ return SIZET2NUM(total_bytes_dec);
+}
+
/*
  * Document-module: Mwrap
  *
@@ -1084,6 +1103,8 @@ void Init_mwrap(void)
   rb_define_singleton_method(mod, "each", mwrap_each, -1);
   rb_define_singleton_method(mod, "[]", mwrap_aref, 1);
   rb_define_singleton_method(mod, "quiet", mwrap_quiet, 0);
+ rb_define_singleton_method(mod, "total_bytes_allocated", total_inc, 0);
+ rb_define_singleton_method(mod, "total_bytes_freed", total_dec, 0);
   rb_define_method(cSrcLoc, "each", src_loc_each, 0);
   rb_define_method(cSrcLoc, "frees", src_loc_frees, 0);
   rb_define_method(cSrcLoc, "allocations", src_loc_allocations, 0);
diff --git a/test/test_mwrap.rb b/test/test_mwrap.rb
index 8425c35..d112b4e 100644
--- a/test/test_mwrap.rb
+++ b/test/test_mwrap.rb
@@ -272,4 +272,15 @@ class TestMwrap < Test::Unit::TestCase
       res == :foo or abort 'Mwrap.quiet did not return block result'
     end;
   end
+
+ def test_total_bytes
+ assert_separately(+"#{<<~"begin;"}\n#{<<~'end;'}")
+ begin;
+ require 'mwrap'
+ Mwrap.total_bytes_allocated > 0 or abort 'nothing allocated'
+ Mwrap.total_bytes_freed > 0 or abort 'nothing freed'
+ Mwrap.total_bytes_allocated > Mwrap.total_bytes_freed or
+ abort 'freed more than allocated'
+ end;
+ end
end


#6

Yes, this patch looks right to me.

Even if we don't have perfect fidelity here it will give absolute
clarity on "leak" vs "fragmentation related bloat". Even though
jemalloc tries hard to compensate for fragmentation bloat is still
possible.

For full context here is a dump when I started the process (it was 500meg rss)

https://transfer.sh/Q9zQS/start.txt

Here is how it looks now (1.2G rss):

https://transfer.sh/14fokY/now.txt

The script I use to generate this stuff is:


(Eric Wong) #7

Yes, this patch looks right to me.

OK, I've just pushed it out to RubyGems.org as a prerelease:

    mwrap-2.0.0.4.gd1ea.gem

Even if we don't have perfect fidelity here it will give absolute
clarity on "leak" vs "fragmentation related bloat". Even though
jemalloc tries hard to compensate for fragmentation bloat is still
possible.

Just wondering, are you still on jemalloc 3.6.0 or one of the
newer versions? I seem to remember 3.6.0 interacting badly with
cross-thread frees (from another project years ago); and mwrap
relies on call_rcu to free memory which is in another thread...

Maybe narenas:1 or even using MALLOC_ARENA_MAX=1 glibc malloc
might make it easier to discern a real leak from fragmentation.

In any case; one technique I've used in the past which never
required special debugging tools (aside from source access) was
to use a bisection search over the code path. I disabled/skipped
over half the remaining code until the leak could no longer be
reproduced to narrow down where it did happen.

···

Sam Saffron <sam.saffron@gmail.com> wrote: