Rinda: start_keeper bug? Looking for second opinion

I believe I've found a race condition in Rinda. But it seems a little
"too easy", so I wonder if I'm missing something. I'd like a second
opinion from someone more familiar with the implementation before I file
a bug report.

The issue is in Rinda::TupleSpace#move (and also #write and #read). Each
of those methods calls Rinda::TupleSpace#start_keeper to start the
"keeper" thread that checks periodically for expired tuples.
Unfortunately, I believe the thread is started too early. As a
consequence, it can terminate prematurely and leave the TupleSpace with
expiring tuples and no thread to clear them out. (e.g. calls fail to
time out properly, etc.)

Here is the sequence of events, based on ruby 1.8.6p0 (but it seems that
trunk has not changed since then).

ts = Rinda::TupleSpace.new(1) # Check interval of 1 second
# Don't write anything into the TupleSpace before...
tup = ts.take([:foo], 1) # <- hangs instead of timing out

Why? In tuplespace.rb, line 442 (TupleSpace#move), start_keeper is
called. It starts the thread (line 568). If the thread is scheduled
immediately, it will immediately check need_keeper? (line 579) and,
discovering that the TupleSpace is empty, it promptly terminates. Later,
the main thread pushes the WaitTemplateEntry (with the 1 second timeout)
onto @take_waiter (line 454), expecting that the keeper thread will
collect it after the entry expires. But the keeper thread is already
gone. Thus, the wait template never dies, and the take call never
returns-- at least until someone else comes in and starts another keeper
thread.

Shouldn't start_keeper be called AFTER pushing the wait template onto
@take_waiter? Similarly, in #read, shouldn't line 479 come AFTER line
486, and similarly also in #write? Am I missing something?

Furthermore, it doesn't seem that start_keeper should care about
remaining outside the synchronize sections in those methods. In fact, if
anything, maybe it should be INSIDE the monitor lock to prevent
interleaved calls to start_keeper (and resultant spawning of multiple
keeper threads).

Puzzled,
Daniel Azuma
dazuma@gmail.com

···

--
Posted via http://www.ruby-forum.com/.

I believe I've found a race condition in Rinda. But it seems a little
"too easy", so I wonder if I'm missing something. I'd like a second
opinion from someone more familiar with the implementation before I file
a bug report.

The issue is in Rinda::TupleSpace#move (and also #write and #read). Each
of those methods calls Rinda::TupleSpace#start_keeper to start the
"keeper" thread that checks periodically for expired tuples.
Unfortunately, I believe the thread is started too early. As a
consequence, it can terminate prematurely and leave the TupleSpace with
expiring tuples and no thread to clear them out. (e.g. calls fail to
time out properly, etc.)

Here is the sequence of events, based on ruby 1.8.6p0 (but it seems that
trunk has not changed since then).

ts = Rinda::TupleSpace.new(1) # Check interval of 1 second
# Don't write anything into the TupleSpace before...
tup = ts.take([:foo], 1) # <- hangs instead of timing out

Why? In tuplespace.rb, line 442 (TupleSpace#move), start_keeper is
called. It starts the thread (line 568). If the thread is scheduled
immediately, it will immediately check need_keeper? (line 579) and,
discovering that the TupleSpace is empty, it promptly terminates. Later,
the main thread pushes the WaitTemplateEntry (with the 1 second timeout)
onto @take_waiter (line 454), expecting that the keeper thread will
collect it after the entry expires. But the keeper thread is already
gone. Thus, the wait template never dies, and the take call never
returns-- at least until someone else comes in and starts another keeper
thread.

Shouldn't start_keeper be called AFTER pushing the wait template onto
@take_waiter? Similarly, in #read, shouldn't line 479 come AFTER line
486, and similarly also in #write? Am I missing something?

I think you are right. I have CC'd Masatoshi SEKI.

Furthermore, it doesn't seem that start_keeper should care about
remaining outside the synchronize sections in those methods. In fact, if
anything, maybe it should be INSIDE the monitor lock to prevent
interleaved calls to start_keeper (and resultant spawning of multiple
keeper threads).

At worst, a separate synchronize block could be put in start_keeper.

···

On Aug 9, 2007, at 23:48, Daniel Azuma wrote:

--
Poor workers blame their tools. Good workers build better tools. The
best workers get their tools to do the work for them. -- Syndicate Wars