Request For Comments: exception safe ConditionVariable#wait

Hi everyone,

I'd recently run into some trouble trying to use timeout
with ConditionVariable#wait. (Then I tried some custom
solutions, which turned out to be unsafe themselves and
led me down that "unexpected deadlock" wild goose chase.)

I'd like to propose making ConditionVariable#wait itself
exception safe, so that using it with timeout should be
as simple as:

    begin
      timeout(timeout_secs) { condition_var.wait(mutex) }
    rescue Timeout::Error
    end

This is the original #wait from thread.rb (1.8.2preview2)

  def wait(mutex)
    begin
      mutex.exclusive_unlock do
        @waiters.push(Thread.current)
        Thread.stop
      end
    ensure
      mutex.lock
    end
  end

It has two problems that I know of. The most serious
is that I believe an exception can occur between the
"begin" and the "mutex.exclusive_unlock" line, such
that, when the "ensure" tries to reacquire the lock,
the mutex had not been unlocked yet, and the thread
hangs trying to get a lock on a mutex it already owns.
(I can get it to hang at the mutex.lock, and I presume
that's the reason.)

The second is that Thread.current is not removed from
@waiters when an exception occurs. This became a
problem in conjunction with timeout, because the
@waiters array was filling up with hundreds of
Thread.current references over time.

My attempt to address these two issues looks like:

  def wait(mutex)
    unlocked = false
    begin
      mutex.exclusive_unlock do
        unlocked = true
        @waiters.push(Thread.current)
        Thread.stop
      end
    rescue # Timeout::Error
      @waiters.delete Thread.current # Q: is Array#delete an atomic operation?
      raise
    ensure
      mutex.lock if unlocked
    end
  end

Does this seem reasonable? Have I made any obvious errors?
Is this something that would best be proposed as an RCR?

I was wondering whether Array#delete is an atomic operation?
I would not want competing threads to both be trying to
delete from the same array simultaneously, if that's not safe.
(Do I need to put Thread.exclusive around it?)

Also - Have I done that rescue/raise properly, to re-raise
the exception? I'm asking because in testing with this
modified #wait, I have to hit ^C many times to interrupt
my process (both win32 and linux). It's as though the
Interrupt exception is being gobbled, which is not what I
would want.

   ... Hmm... I see that Timeout::Error's ancestors
   include Interrupt, SignalException ... I am intentionally
   catching Timeout::Error when i call #wait, like:

      begin
        timeout(timeout_secs) { condition_var.wait(mutex) }
      rescue Timeout::Error
      end

   ...but this shouldn't catch Interrupt exceptions
   should it?

Anyway any comments/criticism/suggestions/feedback would be
most welcome . . . Is RCR the correct process for a proposal
like this?

Thanks,

Regards,

Bill

IMO, this isn't so much an exception-safety issue as it is a
thread-safety issue. To write exception-safe code, it is necessary to
guarantee that certain code will never raise an exception. For example,
suppose that I have implemented my own string class:

  class MyString
    def initialize(characters, encoding)
      @characters = characters.dup
      @encoding = encoding
    end

    attr_reader :characters, :encoding
  end

now suppose I want to write MyString#replace. I can easily do this if I
can guarantee that swap() never raises an exception:

  class MyString
    def replace(other)
      copy = other.dup
      swap(copy)
    end

    def swap(other)
      tmp_characters = other.characters
      tmp_encoding = other.encoding
      other.characters = self.characters
      other.encoding = self.encoding
      self.characters = tmp_characters
      # better not get an exception here!
      self.encoding = tmp_encoding
    end
  end

Now if I write:

  s1 = MyString.new([?a, ?b, ?c], :ASCII)
  s2 = MyString.new([?1, ?2, ?3], :ASCII)
  timeout(x) do
    s.replace(s2)
  end

I'm in trouble, because suddenly I can get a TimeoutError from inside
MyString#swap.

Your solution for ConditionVariable#wait works for that one case, but
it's not generally applicable to other cases like this one. Instead,
for each case, I either have to set Thread.critical or I have to write
code to rollback any changes I make after an exception anywhere I
non-atomically change the state of an object. What a pain (and a
performance bottleneck!)

I'm not sure what a good general solution the problem is, but consider
this:

  - a keyword (or method) is added to Ruby that indicates a method (or
    block) never raises an exception
  - if Thread#raise is called while the thread is executing that method
    (or block), then set a flag and wait for the block to exit before
    actually raising the exception
  - If the user still wants to abort execution of that block, the user
    can call Thread#force_raise or Thread#kill.

So now MyString#swap looks like this:

  def swap(other)
    no_raise do
      tmp_characters = other.characters
      tmp_encoding = other.encoding
      other.characters = self.characters
      other.encoding = self.encoding
      self.characters = tmp_characters
      # better not get an exception here!
      self.encoding = tmp_encoding
    end
  end

A similar solution can be applied to your revised version of
ConditionVariable#wait (consider what happens if you time out twice,
e.g.):

  def foo(timeout_secs)
    begin
      timeout(timeout_secs) { @condition_var.wait(@mutex) }
    rescue Timeout::Error
    end
  end

  timeout(5.01) do
    foo(5)
  end

Paul

Hi Paul,

IMO, this isn't so much an exception-safety issue as it is a
thread-safety issue.

With timeout, I'm thinking it's sort of a combination of
exceptions and threads. Timeout can raise an exception at
"any point" within the timeout block; but it uses a secondary
thread to perform the raise. Which means, things like
Thread.exclusive { } can be used to protect blocks of code
against the exception that would be raised by the timeout thread.

I hadn't thought about the more general exception safety
issues you bring up:

To write exception-safe code, it is necessary to
guarantee that certain code will never raise an exception. For example,
suppose that I have implemented my own string class:

  class MyString
    def initialize(characters, encoding)
      @characters = characters.dup
      @encoding = encoding
    end

    attr_reader :characters, :encoding
  end

now suppose I want to write MyString#replace. I can easily do this if I
can guarantee that swap() never raises an exception:

  class MyString
    def replace(other)
      copy = other.dup
      swap(copy)
    end

    def swap(other)
      tmp_characters = other.characters
      tmp_encoding = other.encoding
      other.characters = self.characters
      other.encoding = self.encoding
      self.characters = tmp_characters
      # better not get an exception here!
      self.encoding = tmp_encoding
    end
  end

Now if I write:

  s1 = MyString.new([?a, ?b, ?c], :ASCII)
  s2 = MyString.new([?1, ?2, ?3], :ASCII)
  timeout(x) do
    s.replace(s2)
  end

I'm in trouble, because suddenly I can get a TimeoutError from inside
MyString#swap.

Your solution for ConditionVariable#wait works for that one case, but
it's not generally applicable to other cases like this one. Instead,
for each case, I either have to set Thread.critical or I have to write
code to rollback any changes I make after an exception anywhere I
non-atomically change the state of an object. What a pain (and a
performance bottleneck!)

I agree - but this seems to be in the nature of timeout
itself. My proposed changes to ConditionVariable#wait were
indeed intended to only ensure that that one method, #wait,
would function properly when used inside a timeout block.
(Because I happen to be in need of a #wait that I can time
out on. :slight_smile:

I see what you mean now (exception safety vs. thread
safety.) I guess the email subject should more properly
be, "timeout-safe ConditionVariable#wait". :slight_smile:

My sense is that currently, one has to be very careful
where one uses a timeout. I don't use timeout very often,
so a lack of a general solution isn't necessarly giving me
too much grief. But I do need to be sure that whatever I
do use timeout with, is safe. (At least that's my current
mindset.)

I'm not sure what a good general solution the problem is, but consider
this:

  - a keyword (or method) is added to Ruby that indicates a method (or
    block) never raises an exception
  - if Thread#raise is called while the thread is executing that method
    (or block), then set a flag and wait for the block to exit before
    actually raising the exception
  - If the user still wants to abort execution of that block, the user
    can call Thread#force_raise or Thread#kill.

So now MyString#swap looks like this:

  def swap(other)
    no_raise do
      tmp_characters = other.characters
      tmp_encoding = other.encoding
      other.characters = self.characters
      other.encoding = self.encoding
      self.characters = tmp_characters
      # better not get an exception here!
      self.encoding = tmp_encoding
    end
  end

Interesting. Kind of like an interrupt mask for exceptions. :slight_smile:

If you only wanted to protect against timeout, and not
exceptions in general, I think (as you've already
observed) you could:

   def swap(other)
     Thread.exclusive do
       tmp_characters = other.characters
       tmp_encoding = other.encoding
       other.characters = self.characters
       other.encoding = self.encoding
       self.characters = tmp_characters
       # better not get an exception here!
       self.encoding = tmp_encoding
     end
   end

A similar solution can be applied to your revised version of
ConditionVariable#wait (consider what happens if you time out twice,
e.g.):

  def foo(timeout_secs)
    begin
      timeout(timeout_secs) { @condition_var.wait(@mutex) }
    rescue Timeout::Error
    end
  end

  timeout(5.01) do
    foo(5)
  end

I'm sorry, I'm not sure I'm following. Does the nested
timeout here violate any safeguards inside the revised
ConditionVariable#wait implementation? I realize the
rescue Timeout::Error there is not sufficient to
squelch timeout exceptions from any nested/outer timeout
block. Your outer timeout would need to rescue its own
exceptions as well. . . . Is that what you were referring
to? Or am I not understanding your example?

Thanks for your feedback !!

Regards,

Bill

Your revised ConditionVariable#wait:

  def wait(mutex)
    unlocked = false
    begin
      mutex.exclusive_unlock do
        unlocked = true
        @waiters.push(Thread.current)
        Thread.stop
      end
    rescue # Timeout::Error
      @waiters.delete Thread.current # Q: is Array#delete an atomic operation?
      raise
    ensure
      mutex.lock if unlocked
    end
  end

What happens if you get a timeout first inside the exclusive_unlock
block, then inside the rescue (before the delete occurs)?

(I believe that this is a pretty rare race condition, but is still
possible).

Paul

···

On Sat, Aug 28, 2004 at 06:08:39AM +0900, Bill Kelly wrote:

I'm sorry, I'm not sure I'm following. Does the nested
timeout here violate any safeguards inside the revised
ConditionVariable#wait implementation? I realize the
rescue Timeout::Error there is not sufficient to
squelch timeout exceptions from any nested/outer timeout
block. Your outer timeout would need to rescue its own
exceptions as well. . . . Is that what you were referring
to? Or am I not understanding your example?

Hi Paul,

Your revised ConditionVariable#wait:

  def wait(mutex)
    unlocked = false
    begin
      mutex.exclusive_unlock do
        unlocked = true
        @waiters.push(Thread.current)
        Thread.stop
      end
    rescue # Timeout::Error
      @waiters.delete Thread.current # Q: is Array#delete an atomic operation?
      raise
    ensure
      mutex.lock if unlocked
    end
  end

What happens if you get a timeout first inside the exclusive_unlock
block, then inside the rescue (before the delete occurs)?

(I believe that this is a pretty rare race condition, but is still
possible).

Wow, nice catch! Thanks - !

Hmm... Would that mean it's possible for

     ensure
# -------> %exception here%
       mutex.lock if unlocked
     end

potentially an exception firing once we're in the
ensure block as well? Or is ensure implemented to
have the properties of your no_raise style block,
I wonder?

Well, the reason I had proposed the timeout-safe
wait, or wanted to implement it using timeout,
is because I thought it would be easier than the
signal-based timed_wait I had also implemented.

I wrote what I believed to be a thread-safe (and
hopefully exception safe) timed_wait using signals,
to sidestep the "timeout" exception throwing issue.
But it came out to a disconcerting 36 lines...

01: class ConditionVariable
02:
03: def timed_wait(mutex, timeout_secs)
04: waiter_th = Thread.current
05: alarm_th = nil
06: unlocked = false
07: begin
08: awake = false
09:
10: alarm_th = Thread.start do
11: sleep timeout_secs
12: Thread.exclusive do
13: unless awake
14: awake = true
15: @waiters.delete waiter_th
16: waiter_th.wakeup
17: end
18: end
19: end
20:
21: Thread.exclusive do
22: unless awake
23: mutex.exclusive_unlock do
24: unlocked = true
25: @waiters.push(Thread.current)
26: begin
27: Thread.stop
28: ensure
29: awake = true
30: end
31: end
32: end
33: end
34: ensure
35: mutex.lock if unlocked
36: alarm_th.kill if alarm_th and alarm_th.alive?
37: end
38: end
39:
40: end

I say disconcerting because the code in thread.rb is
so beautifully simple. (Although, as we have seen,
it is not necessarily 100% well-protected, on the other
hand.) But this routine above seemed more complicated
than I'd like. [Note - I didn't care if it worked with
timeout, since it doesn't use timeout.]

. . .

Wow your observation has really got me curious now:
Can the exception raised by timeout occur anywhere
in:
  - a rescue block?
  - an ensure block?

That would seem to make it really hard to guard
against...

Thanks for your thoughts !

Regards,

Bill

Wow, nice catch! Thanks - !

Hmm... Would that mean it's possible for

     ensure
# -------> %exception here%
       mutex.lock if unlocked
     end

potentially an exception firing once we're in the
ensure block as well? Or is ensure implemented to
have the properties of your no_raise style block,
I wonder?

Actually the exception has the potential to occur when calling the
Mutex#lock method, if Ruby decides to switch threads just at the point
of the call (so it's not really between the ensure and the next
statement).

Wow your observation has really got me curious now:
Can the exception raised by timeout occur anywhere
in:
  - a rescue block?
  - an ensure block?

I'm not an expert on the Ruby interpreter, but if I understand eval.c
correctly, it can occur any time Ruby can switch threads, which
includes but is not limited to:
  - each time a node is evaluated
  - each time (just before) a method is called
  - each time an extension waits for IO using rb_thread_select

That would seem to make it really hard to guard
against...

I agree. I've never really liked the thread-based timeout for exactly
that reason (that and my C++ extensions don't always play nicely with
Ruby threads, since they keep data on the stack). Instead I use an
event loop, and let the event loop handle my timeouts; this guarantees
that timeout exceptions are only raised from well-defined points in the
code.

Paul

···

On Sat, Aug 28, 2004 at 05:38:53PM +0900, Bill Kelly wrote: