Deadlock in large vm_xmalloc while doing backtrace

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler ("foreman")
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack elided):

Thread 4 (process 5488):
#0 0x00d4d410 in __kernel_vsyscall ()
#1 0x00d1a0b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00d15834 in _L_lock_92 () from /lib/libpthread.so.0
#3 0x00d1534a in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x0811d82e in native_mutex_lock (lock=0xfffffffc) at
thread_pthread.c:36
#5 0x081226a6 in rb_thread_call_with_gvl (func=0x8062b40 <gc_with_gvl>,
data1=0x81954e0) at thread.c:984
#6 0x0806282e in garbage_collect_with_gvl (objspace=0x81954e0) at
gc.c:598
#7 0x080628a6 in vm_xmalloc (objspace=0x81954e0, size=160796684) at
gc.c:624
#8 0x080de09e in rb_str_buf_new (capa=128) at string.c:722
#9 0x080d29bb in rb_enc_vsprintf (enc=0x0, fmt=0x81552ac "%s:%d:in
`%s'", ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1125
#10 0x080d2a50 in rb_vsprintf (fmt=0x81552ac "%s:%d:in `%s'",
ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1154
#11 0x080d2a7b in rb_sprintf (format=0x81552ac "%s:%d:in `%s'") at
sprintf.c:1164
#12 0x0810ef01 in vm_backtrace_each (th=<value optimized out>,
limit_cfp=0xb7fbea98, cfp=0xb7fbed98, file=0x9a42158
"/local/pete/current/com.rb", line_no=34, ary=174937756) at vm.c:714
#13 0x0810fd32 in rb_f_caller (argc=0, argv=0xb7f3f1f8) at vm.c:746
#14 0x0810dea6 in call_cfunc (func=0x810fca0 <rb_f_caller>,
recv=165378108, len=128, argc=-4, argv=0xb7f3f1f8) at
vm_insnhelper.c:286

The foreman's apparently trying to lock the Global Vm Lock to do a
malloc().
Several questions:
1) How to tell which thread owns the global_vm_lock? In gdb, printing
th->blocking_region_buffer for the main thread returns a pointer to
something, while another thread returns a zero value for
blocking_region_buffer. This suggests that the other thread owns the
global_vm_lock. On the other hand, if I dump the actual global_vm_lock,
the __data.__owner is 5488, the thread id for the main thread. I'm
thinking that this thread is trying to lock the GVL when it already has
it, but I'm not sure.
2) in frame 8, the rb_str_buf_new should be requesting 129 bytes, yet
the size parameter in frame 7 is more than 160 Megabytes! This huge
allocation appears to be what triggers garbage collection with gvl. How
this happens is a complete mystery.
3) within gdb, is it possible to figure out what the content of the
backtrace would be if it didn't hang? I know that ruby's backtrace stack
is somewhat complex - any pointers to simple-ish ways to find out where
the ruby interpreter is?

Thanks,
Bill Paulson

···

--
Posted via http://www.ruby-forum.com/.

Bill Paulson писал 05.06.2012 22:52:

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler ("foreman")
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack elided):

1.9.1 is something of a danger sign for me: it never worked good enough for
production, at least for me.

Thread 4 (process 5488):
#0 0x00d4d410 in __kernel_vsyscall ()
#1 0x00d1a0b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00d15834 in _L_lock_92 () from /lib/libpthread.so.0
#3 0x00d1534a in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x0811d82e in native_mutex_lock (lock=0xfffffffc) at
thread_pthread.c:36
#5 0x081226a6 in rb_thread_call_with_gvl (func=0x8062b40 <gc_with_gvl>,
data1=0x81954e0) at thread.c:984
#6 0x0806282e in garbage_collect_with_gvl (objspace=0x81954e0) at
gc.c:598
#7 0x080628a6 in vm_xmalloc (objspace=0x81954e0, size=160796684) at
gc.c:624
#8 0x080de09e in rb_str_buf_new (capa=128) at string.c:722
#9 0x080d29bb in rb_enc_vsprintf (enc=0x0, fmt=0x81552ac "%s:%d:in
`%s'", ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1125
#10 0x080d2a50 in rb_vsprintf (fmt=0x81552ac "%s:%d:in `%s'",
ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1154
#11 0x080d2a7b in rb_sprintf (format=0x81552ac "%s:%d:in `%s'") at
sprintf.c:1164
#12 0x0810ef01 in vm_backtrace_each (th=<value optimized out>,
limit_cfp=0xb7fbea98, cfp=0xb7fbed98, file=0x9a42158
"/local/pete/current/com.rb", line_no=34, ary=174937756) at vm.c:714
#13 0x0810fd32 in rb_f_caller (argc=0, argv=0xb7f3f1f8) at vm.c:746
#14 0x0810dea6 in call_cfunc (func=0x810fca0 <rb_f_caller>,
recv=165378108, len=128, argc=-4, argv=0xb7f3f1f8) at
vm_insnhelper.c:286

The foreman's apparently trying to lock the Global Vm Lock to do a
malloc().
Several questions:
1) How to tell which thread owns the global_vm_lock? In gdb, printing
th->blocking_region_buffer for the main thread returns a pointer to
something, while another thread returns a zero value for
blocking_region_buffer. This suggests that the other thread owns the
global_vm_lock. On the other hand, if I dump the actual global_vm_lock,
the __data.__owner is 5488, the thread id for the main thread. I'm
thinking that this thread is trying to lock the GVL when it already has
it, but I'm not sure.
2) in frame 8, the rb_str_buf_new should be requesting 129 bytes, yet
the size parameter in frame 7 is more than 160 Megabytes! This huge
allocation appears to be what triggers garbage collection with gvl. How
this happens is a complete mystery.
3) within gdb, is it possible to figure out what the content of the
backtrace would be if it didn't hang? I know that ruby's backtrace stack
is somewhat complex - any pointers to simple-ish ways to find out where
the ruby interpreter is?

Take a look at ruby's gdbinit:
   http://rxr.whitequark.org/mri/source/.gdbinit

For example, try rb_backtrace or rb_ps at gdb's prompt.

···

Thanks,
Bill Paulson

--
   WBR, Peter Zotov.

In some 1.9.X versions of Ruby there *was* a public funtion in thread.c:

  int rb_has_gvl_p(void)

or maybe:

  int has_gvl_p(void)

Unfortunately it's not present anymore in 1.9.3p0 (but it was in some
1.9.2pXXX version, sure, I used it).

···

2012/6/5 Bill Paulson <lists@ruby-forum.com>:

1) How to tell which thread owns the global_vm_lock?

--
Iñaki Baz Castillo
<ibc@aliax.net>

Bill Paulson писал 05.06.2012 22:52:

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler ("foreman")
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack elided):

1.9.1 is something of a danger sign for me: it never worked good enough for
production, at least for me.

I'll look into changing to a later release.

Thread 4 (process 5488):
#0 0x00d4d410 in __kernel_vsyscall ()
#1 0x00d1a0b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2 0x00d15834 in _L_lock_92 () from /lib/libpthread.so.0
#3 0x00d1534a in pthread_mutex_lock () from /lib/libpthread.so.0
#4 0x0811d82e in native_mutex_lock (lock=0xfffffffc) at
thread_pthread.c:36
#5 0x081226a6 in rb_thread_call_with_gvl (func=0x8062b40 <gc_with_gvl>,
data1=0x81954e0) at thread.c:984
#6 0x0806282e in garbage_collect_with_gvl (objspace=0x81954e0) at
gc.c:598
#7 0x080628a6 in vm_xmalloc (objspace=0x81954e0, size=160796684) at
gc.c:624
#8 0x080de09e in rb_str_buf_new (capa=128) at string.c:722
#9 0x080d29bb in rb_enc_vsprintf (enc=0x0, fmt=0x81552ac "%s:%d:in
`%s'", ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1125
#10 0x080d2a50 in rb_vsprintf (fmt=0x81552ac "%s:%d:in `%s'",
ap=0xbf8f95e4 "X!?\t\"") at sprintf.c:1154
#11 0x080d2a7b in rb_sprintf (format=0x81552ac "%s:%d:in `%s'") at
sprintf.c:1164
#12 0x0810ef01 in vm_backtrace_each (th=<value optimized out>,
limit_cfp=0xb7fbea98, cfp=0xb7fbed98, file=0x9a42158
"/local/pete/current/com.rb", line_no=34, ary=174937756) at vm.c:714
#13 0x0810fd32 in rb_f_caller (argc=0, argv=0xb7f3f1f8) at vm.c:746
#14 0x0810dea6 in call_cfunc (func=0x810fca0 <rb_f_caller>,
recv=165378108, len=128, argc=-4, argv=0xb7f3f1f8) at
vm_insnhelper.c:286

The foreman's apparently trying to lock the Global Vm Lock to do a
malloc().
Several questions:
1) How to tell which thread owns the global_vm_lock? In gdb, printing
th->blocking_region_buffer for the main thread returns a pointer to
something, while another thread returns a zero value for
blocking_region_buffer. This suggests that the other thread owns the
global_vm_lock. On the other hand, if I dump the actual global_vm_lock,
the __data.__owner is 5488, the thread id for the main thread. I'm
thinking that this thread is trying to lock the GVL when it already has
it, but I'm not sure.
2) in frame 8, the rb_str_buf_new should be requesting 129 bytes, yet
the size parameter in frame 7 is more than 160 Megabytes! This huge
allocation appears to be what triggers garbage collection with gvl. How
this happens is a complete mystery.
3) within gdb, is it possible to figure out what the content of the
backtrace would be if it didn't hang? I know that ruby's backtrace stack
is somewhat complex - any pointers to simple-ish ways to find out where
the ruby interpreter is?

Take a look at ruby's gdbinit:
http://rxr.whitequark.org/mri/source/.gdbinit

For example, try rb_backtrace or rb_ps at gdb's prompt.

The ruby we're running doesn't seem to have debug.c compiled into it. The .gdbinit file fails with a message
No symbol "ruby_dummy_gdb_enums" in current context.
Do you know what compile flags cause debug.c to be included?
Thanks
Bill Paulson

While not completely certain, this bug appears to be due to using the 'caller' method inside an object finalizer. I've written a bug report on a simpler program (Bug #6558: Crash in garbage collection - using caller inside finalizer method - Ruby master - Ruby Issue Tracking System) which I think is related. Removing 'caller()' from the finalizer in our real code seems to have stopped the problems (knocks on wood and crosses fingers.)

Bill Paulson

···

On Jun 5, 2012, at 5:11 PM, Peter Zotov <whitequark@whitequark.org> wrote:

Bill Paulson писал 05.06.2012 22:52:

I work on a fairly large Ruby application that runs as a daemon,
accepting requests for test jobs and distributing them. It runs on Ruby
1.9.1-p429 currently on Linux Redhat 5. The job scheduler ("foreman")
occasionally locks up in one site or another with a stack trace for the
main thread that looks like the following (many levels of stack

William Paulson писал 06.06.2012 17:42:

Bill Paulson писал 05.06.2012 22:52:

Take a look at ruby's gdbinit:
http://rxr.whitequark.org/mri/source/.gdbinit

For example, try rb_backtrace or rb_ps at gdb's prompt.

The ruby we're running doesn't seem to have debug.c compiled into it.
The .gdbinit file fails with a message
No symbol "ruby_dummy_gdb_enums" in current context.
Do you know what compile flags cause debug.c to be included?
Thanks
Bill Paulson

I compile my Rubies with rvm[1], and it inserts debug flags automatically.
I'd say you should look into options of ./configure script: something like
--with-debug=yes should make the trick.

  [1]: https://rvm.io/

···

--
   WBR, Peter Zotov.