[BUG?] apparent hang (win32 and Linux), have stack trace

Hi Ruby interals gurus !

I'm hoping anyone can tell me, based on the attached stack
trace (from CVS 1.8.2preview2, [i386-mswin32]), whether this
is indicitave of a bug in Ruby.

I have a fairly small script now, which hangs on both
win32 (1.8.2preview2) and linux (1.8.0). It takes anywhere
from a few minutes to a few hours to lock up. (You can
just start it at let it run, no interaction required.)

On Linux, I am able to ^C the process and get a ruby
stack trace. On Windows ^C doesn't abort (common enough
when stuck in Windows IO routines), but I was able to
attach the debugger to the process and get a stack trace.

The script is at:
http://bwk.homeip.net/ftp/buffered-io-test4a.rb

The ruby stack trace I get when interrupting the hung
process on Linux, is always at or immediately following
the Thread.stop in ConditionVariable#wait:

/usr/local/lib/ruby/1.8/thread.rb:192:in `stop': Interrupt from /usr/local/lib/ruby/1.8/thread.rb:192:in `wait'
    from /usr/local/lib/ruby/1.8/thread.rb:190:in `exclusive_unlock'
    from /usr/local/lib/ruby/1.8/thread.rb:140:in `exclusive'
    from /usr/local/lib/ruby/1.8/thread.rb:140:in `exclusive_unlock'
    from /usr/local/lib/ruby/1.8/thread.rb:190:in `wait'
    from buffered-io-test4a.rb:59:in `global_wait'
    from buffered-io-test4a.rb:58:in `synchronize'
    from buffered-io-test4a.rb:58:in `global_wait'
    from buffered-io-test4a.rb:87
    from buffered-io-test4a.rb:72:in `loop'
    from buffered-io-test4a.rb:88

The windows stack trace (attached) always looks about the
same each time, too. (rb_thread_stop -> rb_thread_schedule
-> rb_w32_select).

Thanks for your help !!

Regards,

Bill

hang-1.8.2preview2-stacktrace.txt (5.5 KB)

Hi again,

I wondered if I could check an assumption or two here...
I have a slightly tweaked version of the prior script,
which for whatever reason seems to hang even more quickly.
(Seconds to minutes....)
http://bwk.homeip.net/ftp/buffered-io-test4b.rb

I've recompiled ruby without optimization so I can see
what's going on in the debugger more easily.

One assumption I'd like to check, is whether a
(delay == DELAY_INFTY) select should *ever* occur in
rb_thread_schedule while my script is running.

I am thinking this should not occur, because my main
thread *only* ever sleeps in the following situation:

  begin
    alarm_th = Thread.start {
      sleep timeout_secs
      $global_signal.broadcast # %%WAS: .signal
    }
    $global_mutex.synchronize {
      print "{"
      $global_signal.wait($global_mutex)
      print "}"
    }
  ensure
    alarm_th.kill if alarm_th and alarm_th.alive?
  end

...that is, my main thread only goes to sleep when
another thread has specifically been created with
a non-INFTY delay (sleep timeout_secs) in order to
wake my main thread up. So, I'd think that there
should be no way to get a (delay == DELAY_INFTY) in
rb_thread_schedule.

...UNLESS... . . . The other assumption I'd like to
check:

My main thread also reads and writes to a socket:

  if select(nil, [$cl_sock], nil, 0)
    print "W"
    sent = $cl_sock.send(" #$frame ", 0)
  end

  if select([$cl_sock], nil, nil, 0)
    print "R"
    dat = $cl_sock.recv(65536)
  end

... my assumption here is that send/recv WILL NOT
block, because select just told me the socket is
readable or writable. (No other thread is touching
this socket.)

According to what I'm seeing in the debugger, what
is apparently happening is that I eventually get
a (delay == DELAY_INFTY) select, and it waits
forever.

Also - it would appear this hanging condition only
occurs when the main thread is in the:

      print "{"
-> $global_signal.wait($global_mutex)
      print "}"

...and never in the send/recv part of the code.

So it seems a (delay == DELAY_INFTY) is happening
there, when I think it shouldn't.... Am I wrong?

Thanks for your help,

Regards,

Bill

Hi yet again, :slight_smile:

I think my script is simple enough to post now...
After running it for a few minutes, I get:

w{-}{}{}S+w{-}{}{}S+w{-}{}{}S+w{-}{
deadlock 0x2a74888: sleep:- - c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195
deadlock 0x2a85628: sleep:- (main) - c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195
c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195:in `stop': Thread(0x2a85628): deadlock (fatal)
        from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:195:in `wait'
        from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:193:in `exclusive_unlock'
        from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:142:in `exclusive'
        from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:142:in `exclusive_unlock'
        from c:/dev/ruby-dbg/lib/ruby/1.8/thread.rb:193:in `wait'
        from buffered-io-test4d.rb:46:in `global_wait'
        from buffered-io-test4d.rb:44:in `synchronize'
        from buffered-io-test4d.rb:44:in `global_wait'
        from buffered-io-test4d.rb:66
        from buffered-io-test4d.rb:60:in `loop'
        from buffered-io-test4d.rb:60

I believe this script is simple enough that based on
inspection we should be able to say there is no way
it could deadlock.

I'll intersperse some comments to try to support this
assertion:

01:
02: require 'socket'
03: require 'thread'
04: require 'timeout'
05:
06: $frame = 0
07:
08: $global_mutex = Mutex.new
09: $global_signal = ConditionVariable.new
10:
11: $background_mutex = Mutex.new
12: $background_signal = ConditionVariable.new
13:
14: $sock = TCPSocket.new("curmudgeon", 27999)

# "curmudgeon:27999" is a server that will merely
# send back a prompt then wait for us to respond.
# We are not going to respond - so after we've read
# the initial prompt, we expect line 18 to wait
# forever in the select. This is by intention, as
# it seems to be important to have a thread doing
# a time-unlimited select to get these problems to
# occur.

···

#
# (Note that we can replace line 14 with:
# $server = TCPServer.new(12345)
# $sock = TCPSocket.new("localhost", 12345)
# and get similar erroneous behavior. But for
# whatever reason the localhost version exhibits
# the "hanging" behavior, while the "curmudgeon:27999"
# version prefers to deadlock.)

# So - once background_read has read the initial
# data it receives from the server, we expect it
# to just wait forever on line 18.

15:
16: def background_read
17: loop do
18: if select([$sock], nil, nil, nil)
19: print "r"
20: dat = $bg_rd_sock.recv(65536)
21: $global_signal.signal
22: end
23: end
24: end
25:

# background_write is the only code owning the
# $background_mutex... so there should never be
# any deadlock contention there. So all we do
# is repeatedly wait for the main thread to
# signal us. Then print chars to stdout and go
# back to sleep.

26: def background_write
27: loop do
28: $background_mutex.synchronize {
29: print "-"
30: $background_signal.wait($background_mutex)
31: print "+"
32: }
33:
34: print "w"
35: end
36: end
37:

# global_wait creates a secondary alarm thread to
# awaken the primary thread doing the wait after
# the timeout elapses. Line 44 is the only code
# owning the $global_mutex, so there should not be
# any deadlock contention there.
# Line 46 is where the deadlocks and hangs always
# occur.

38: def global_wait(timeout_secs)
39: begin
40: alarm_th = Thread.start {
41: sleep timeout_secs
42: $global_signal.broadcast # %%WAS: .signal
43: }
44: $global_mutex.synchronize {
45: print "{"
46: $global_signal.wait($global_mutex)
47: print "}"
48: }
49: ensure
50: alarm_th.kill if alarm_th and alarm_th.alive?
51: end
52: end
53:
54:
55: bg_read_th = Thread.new { background_read }
56: bg_write_th = Thread.new { background_write }
57:
58: $stdout.sync = true
59:

# Our main loop does nothing but periodically signal
# the background_write thread, and sleep in global_wait.

60: loop do
61: $frame += 1
62: if ($frame % 3) == 0
63: print "S"
64: $background_signal.signal
65: end
66: global_wait(0.003)
67: end
68:

This script will either deadlock, or hang, on both
Windows (1.8.2preview2) and Linux (1.8.0).

I believe it is simple enough to claim that by inspection
it should never deadlock or hang. (Unless I am just
being dense.)

I will begin delving into rb_thread_shedule to try to
determine what is going awry. Any tips from the
experts would be most welcome. :slight_smile:

Regards,

Bill

15:
16: def background_read
17: loop do
18: if select([$sock], nil, nil, nil)
19: print "r"
20: dat = $bg_rd_sock.recv(65536)
21: $global_signal.signal
22: end
23: end
24: end
25:

Whoops ---- line 20 has a typo. That thread was
exiting anyway. <blush> .... However, that just
makes the script simpler... I've removed the
background_read thread entirely, and its associated
socket. I'm still getting the deadlock...

Hopefully have this figured out soon.

Regards,

Bill

Hi,

At Fri, 27 Aug 2004 05:18:27 +0900,
Bill Kelly wrote in [ruby-talk:110584]:

This script will either deadlock, or hang, on both
Windows (1.8.2preview2) and Linux (1.8.0).

1.9 doesn't seem to deadlock, I'll investigate it more.

···

--
Nobu Nakada

At Fri, 27 Aug 2004 05:18:27 +0900,
Bill Kelly wrote in [ruby-talk:110584]:
> This script will either deadlock, or hang, on both
> Windows (1.8.2preview2) and Linux (1.8.0).

1.9 doesn't seem to deadlock, I'll investigate it more.

Hi Nobu - Thanks!

I'm just writing another email right now, the problem
seems to be in my global_wait function.. =(

Thanks for investigating !!! Sorry to waste your time...

Regards,

Bill

Hello once again,

Hopefully have this figured out soon.

Er. Well I was hoping all this wouldn't turn out to
be ENTIRELY in my code... <rueful grin>

It appears the problem is with my wait/timeout code:

def global_wait(timeout_secs)
  begin
    alarm_th = Thread.start {
      sleep timeout_secs
      $global_signal.broadcast
    }
    $global_mutex.synchronize {
############## if alarm_th's broadcast happens
############## before we get to the wait here,
############## then, obviously, we may never wake up
      $global_signal.wait($global_mutex)
    }
  ensure
    alarm_th.kill if alarm_th and alarm_th.alive?
  end
end

As a workaround, I've changed it to:

def global_wait(timeout_secs)
  begin
    awake = false
    alarm_th = Thread.start {
      sleep timeout_secs
      while not awake
        $global_signal.broadcast
        sleep(0.001) # WAS: Thread.pass
      end
    }
    $global_mutex.synchronize {
      $global_signal.wait($global_mutex)
      awake = true
    }
  ensure
    alarm_th.kill if alarm_th and alarm_th.alive?
  end
end

...So the alarm thread sticks around until it's
*sure* the primary thread has awakened.

I guess it's pretty cheesy, with the sleep(0.001)
and all... But as long as it's foolproof, it will
work for my purposes.

I'm now wondering if I should add any begin/rescue
in the alarm thread, in case #broadcast could ever
raise an exception.... This code will be in a
server that I want to run indefinitely, with no
problems... :slight_smile:

Thanks everyone who followed the thread; sorry if
I've wasted your time . . .

Regards,

Bill