Extremely strange segfault

Hi all,

I’m trying to use ruby on AIX 5.1 to generate Nagios configuration files.
It’s all been working fine until today, when I started having very, very
odd segfaults.

The thing that makes them odd is that they appear and disappear depending
on lines I add or remove in my scripts, and (here’s the weird part) those
lines can be comments or “puts” statements.

I tried adding output statements to track down the segfaults, and they
went away. So, I tried commenting the statements out, and the faults were
still gone. Okay, so I delete them, and now the faults are back.
Depending on where my print statements are and where the comments are and
other completely random things, the segfaults also appear at different
lines in the script.

This is in a 1200 line script, using an ldap.so compbiled against OpenLDAP
2.something and digest/md5. Ooh, except I just discovered that I’m not
actually using digest/md5, but removing it causes another segfault.

Yes, I’m relatively convinced that this is something I’m doing, but, well,
you’ve seen where my debugging has gotten me: very confused.

The ruby is one I compiled myself using gcc 3.3.1 on AIX 5.1. The only
configure flag I used was -qmaxmem=32768, but I did run the following
after configuring:

find . -name Makefile -exec perl -pi -e ‘s/ -brtl$//’ {} ;

perl -pi -e ‘s/^.+RSTRING.+$//’ ext/syck/emitter.c

These were necessary to compile on AIx. The -brtl fix is because
apparently -brtl must appear at the beginning of a DLDFLAGS declaration on
AIX. I know I should have submitted this bug, but I (obviously) have not.
How should I?

The emitter change was because emitter refused to compile based on a
problem with that line. The strange thing? It was commented with ‘//’.
It mentioned (in the comment) a variable not used anywhere else, so gcc
(stupidly?) complained about the missing declaration. I removed the
comment, and all went ok.

Let me guess: My build environment is very broken, so there’s not much
help for me, right?

Well, if anyone has any pointers on how I might track this problem down,
I’d appreciate it. I’m willing to just upgrade, if necessary, but I would
prefer to actually understand what is happening.

I am also willing to share the code, but it’s long enough that I didn’t
want to just spam the list with it.

Thanks,
Luke

···


I don’t want the world, I just want your half.

Sorry, forgot to mention: I’m using ruby 1.8.0.

···


Meeting, n.:
An assembly of people coming together to decide what person or
department not represented in the room must solve a problem.

Hi,

These were necessary to compile on AIx. The -brtl fix is because
apparently -brtl must appear at the beginning of a DLDFLAGS declaration on
AIX. I know I should have submitted this bug, but I (obviously) have not.
How should I?

Posting the patch to the ruby-core mailing list is most convenient for me.

I am also willing to share the code, but it’s long enough that I didn’t
want to just spam the list with it.

If you can put your script (and data) to reproduce error on the web,
it’s the best way. Otherwise, send me directly.

						matz.
···

In message “extremely strange segfault” on 03/12/16, “Luke A. Kanies” luke@madstop.com writes:

I tried adding output statements to track down the segfaults, and they
went away. So, I tried commenting the statements out, and the faults were
still gone. Okay, so I delete them, and now the faults are back.
Depending on where my print statements are and where the comments are and
other completely random things, the segfaults also appear at different
lines in the script.

try to give a backtrace when it segfault.

Guy Decoux

These were necessary to compile on AIx. The -brtl fix is because
apparently -brtl must appear at the beginning of a DLDFLAGS declaration on
AIX. I know I should have submitted this bug, but I (obviously) have not.
How should I?

Posting the patch to the ruby-core mailing list is most convenient for me.

Okay, I’ll do that.

I am also willing to share the code, but it’s long enough that I didn’t
want to just spam the list with it.

If you can put your script (and data) to reproduce error on the web,
it’s the best way. Otherwise, send me directly.

Hmmm. I can probably do that (I’ll have to check my employers) but I have
to warn you: It’s about 370 hosts in LDAP using a custom schema.

Before I get to that, let me try to spend some time further isolating the
problem. It’s not exactly straightforward testing, but let me at least
see if I can reproduce the problem without the LDAP data.

Luke

···

On Tue, 16 Dec 2003, Yukihiro Matsumoto wrote:

In message “extremely strange segfault” > on 03/12/16, “Luke A. Kanies” luke@madstop.com writes:


Should I say “I believe in physics”, or “I know that physics is true”?
– Ludwig Wittgenstein, On Certainty, 602.

(Here I delve into unknown territory…)

wzd4845@naadmd02(134) $ gdb /usr/local/bin/ruby
GNU gdb 6.0
Copyright 2003 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you
are
welcome to change it and/or distribute copies of it under certain
conditions.
Type “show copying” to see the conditions.
There is absolutely no warranty for GDB. Type “show warranty” for
details.
This GDB was configured as “powerpc-ibm-aix5.1.0.0”…
(gdb) run -S /home/wzd4845/bin/naghosts
Starting program: /usr/local/bin/ruby -S /home/wzd4845/bin/naghosts

Program received signal SIGSEGV, Segmentation fault.
0x1003a09c in rb_gc_mark ()
(gdb) bt
#0 0x1003a09c in rb_gc_mark ()
#1 0x1003a580 in rb_gc_mark_children ()
#2 0x1003a168 in rb_gc_mark ()
#3 0x1003a778 in rb_gc_mark_children ()
#4 0x1003a168 in rb_gc_mark ()
#5 0x10039d20 in mark_locations_array ()
#6 0x10039db0 in rb_gc_mark_locations ()
#7 0x1003baa0 in rb_gc ()
#8 0x1003960c in rb_newobj ()
#9 0x100147d4 in new_blktag ()
#10 0x1001b098 in rb_eval ()
#11 0x10023ae0 in rb_call0 ()
#12 0x10024100 in rb_call ()
#13 0x100244ec in rb_funcall2 ()
#14 0x1002838c in rb_obj_call_init ()
#15 0x10048f00 in rb_class_new_instance ()
#16 0x10036a48 in call_cfunc ()
#17 0x1002347c in rb_call0 ()
#18 0x10024100 in rb_call ()
#19 0x1001c530 in rb_eval ()
#20 0x10023ae0 in rb_call0 ()
#21 0x10024100 in rb_call ()
#22 0x1001c530 in rb_eval ()
#23 0x1001d4c8 in rb_eval ()
#24 0x10020834 in rb_yield_0 ()
#25 0x10020ca0 in rb_yield ()
#26 0x10022580 in rb_ensure ()
#27 0xd1ae7f90 in rb_ldap_conn_search_b ()
from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so
#28 0x10022580 in rb_ensure ()
—Type to continue, or q to quit—
#29 0xd1ae80e8 in rb_ldap_conn_search_s ()
from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so
#30 0x10036a48 in call_cfunc ()
#31 0x1002347c in rb_call0 ()
#32 0x10024100 in rb_call ()
#33 0x1001c530 in rb_eval ()
#34 0x1001b1d8 in rb_eval ()
#35 0x1001b808 in rb_eval ()
#36 0x10015c34 in eval_node ()
#37 0x10016708 in ruby_exec ()
#38 0x10016850 in ruby_run ()
#39 0x10000570 in main ()

Hopefully that tells you something…

Do you need anything else?

Luke

···

On Tue, 16 Dec 2003, ts wrote:

try to give a backtrace when it segfault.


"The leader of Jamestown was “John Smith” (not his real name), under
whose direction the colony engaged in a number of activities,
primarily related to starving. – Dave Barry, “Dave Barry Slept Here”

Is is feasible to GC.disable in your app? That would at least tell you
if it is a mark/free related bug.

#4 0x1003a168 in rb_gc_mark ()
#5 0x10039d20 in mark_locations_array ()
#6 0x10039db0 in rb_gc_mark_locations ()
#7 0x1003baa0 in rb_gc ()

You have a problem with the GC, it probably find an invalid object on the
stack.

The best is probably to first verify the extensions that you use, one of
these extensions can have a bug.

Guy Decoux

#29 0xd1ae80e8 in rb_ldap_conn_search_s ()
   from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so

What is your version of ruby_ldap ?

If it's 0.8.1 try to use 0.7.2 : there is a bug in rb_ldap_conn_search_s()
for 0.8.1 (see [ruby-talk:85228])

Guy Decoux

I can’t test it until I get back to work, but how would I do that? There
certainly is a decent amount of memory shunting involved, since I’m doing
an ldap query, creating a bunch of objects based on the results, and then
storing the objects in various and sundry groups, along with some
self-rolled auto-vivication.

Basically, I’m pulling a host list from ldap, converting each host into an
object, using some basic logic to store that host in groups for which I’ve
set up pseudo-autovivication, and then looping across each host to do some
other stuff.

I’ve had some strange issues with ldap, and I miss the all-perl Net::LDAP
with SSL, but it basically worked until I started with the autovivication
and adding the objects to lots of groups. That’s why I figure I can
isolate the problem some, but I got sidetracked into trying to make my
lexer perform better (it’s taking about 12 seconds just to tokenize a 99k
file, which seemed high, so…).

Hopefully tomorrow I’ll be able to isolate this problem at least into a
specific component, but so far it’s been, um, strange.

Luke

···

On Tue, 16 Dec 2003, Joel VanderWerf wrote:

Is is feasible to GC.disable in your app? That would at least tell you
if it is a mark/free related bug.


You can’t have everything. Where would you put it?
– Stephen Wright

Well, I can’t precisely say that it was a problem with GC, but I can’t
reproduce the problem with GC disabled.

This is just about the strangest problem I’ve ever had, because it will
appear if I comment a print statement out, but then disappear if I just
delete the print statement. That’s why I can’t clearly say it was a
problem with GC, even though GC seems to fix it: It could be the extra
line in the file that fixes it or something silly like that.

The segfault consistently comes around an .each iteration I have
associated with some LDAP entries and some class definitions. I get
different line numbers for the segfault every time, but it is consistently
somewhere in my processing of the LDAP information. This makes me think
it is a problem with the ldap.so somehow, although I don’t know if
loaded libraries can kill Ruby – I assume so.

If there are any other tests you would like me to try, please let me know.

Luke

···

On Tue, 16 Dec 2003, Joel VanderWerf wrote:

Is is feasible to GC.disable in your app? That would at least tell you
if it is a mark/free related bug.


Due to circumstances beyond your control, you are master of your fate
and captain of your soul.

Yes, it’s 0.8.1. I’ll try 0.7.2 when I get a chance.

Luke

···

On Wed, 17 Dec 2003, ts wrote:

#29 0xd1ae80e8 in rb_ldap_conn_search_s ()
from /usr/local/lib/ruby/site_ruby/1.8/powerpc-aix5.1.0.0/ldap.so

What is your version of ruby_ldap ?

If it’s 0.8.1 try to use 0.7.2 : there is a bug in rb_ldap_conn_search_s()
for 0.8.1 (see [ruby-talk:85228])


My favorite was a professor at a University I Used To Be Associated With
who claimed that our requirement of a non-alphabetic character in our
passwords was an abridgement of his freedom of speech.
– Jacob Haller

“Luke A. Kanies” luke@madstop.com writes:

···

On Tue, 16 Dec 2003, Joel VanderWerf wrote:

Is is feasible to GC.disable in your app? That would at least tell you
if it is a mark/free related bug.

Well, I can’t precisely say that it was a problem with GC, but I can’t
reproduce the problem with GC disabled.

This is just about the strangest problem I’ve ever had, because it will
appear if I comment a print statement out, but then disappear if I just
delete the print statement.

If this is the strangest bug you ever had, you are doing pretty good.

What you have come across is a “Heisenbug”:

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?heisenbug

Happy hunting!

d.k.


Daniel Kelley - San Jose, CA
For email, replace the first dot in the domain with an at.

That’s what this was, though. I would encounter the bug, so then I’d add
some print statements to try to bracket the bug and it would go away.
Okay, then I’d just comment the statements out, maybe their execution
fixed it; still no bug. Okay, delete the statements entirely; now the bug
is back.

The Heisenberg nature of the bug did not stand up to deeper scrutiny, but
it certainly fooled my initial (usually sufficient) debugging.

Luke

···

On Wed, 17 Dec 2003, Daniel Kelley wrote:

“Luke A. Kanies” luke@madstop.com writes:

This is just about the strangest problem I’ve ever had, because it will
appear if I comment a print statement out, but then disappear if I just
delete the print statement.

If this is the strangest bug you ever had, you are doing pretty good.

What you have come across is a “Heisenbug”:

http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?heisenbug


"I think that’s how Chicago got started. A bunch of people in New York
said, ‘Gee, I’m enjoying the crime and the poverty, but it just isn’t
cold enough. Let’s go west.’ "
–Richard Jeni