Checking for race conditions with Ruby threads

I've got an untrustworthy legacy app that seems to have a nasty race
condition when a certain number of users click on a link at just the
right time.
I'm trying to write a Ruby unit test to trigger this, but I'm a bit stumped.

Here's my code:
require 'test/unit'
require 'open-uri'
class TestStream < Test::Unit::TestCase
  def test_work_unit_uniqueness
    ids = []
    threads = []
    100.times do |i|
      threads[i] = Thread.new do
        Thread.current[:appid] = URI("http://the_test_url/something").read
      end
    end
    ids = threads.collect {|t| t.join ; t[:appid]}
    assert_equal(ids, ids.uniq)
  end
end

The test URL spits back a single number, in plain text. The idea is
to hit it 100 times, and make sure that each 'worker' thread got a
unique number. However, it seems to me that with green threads, the
OS running on the Ruby machine will actually serialize these requests.
Do I need to switch to Linux and use 'fork', or is there another way
to handle this?

Thanks,
--Wilson.

Quoting Wilson Bilkovich <wilsonb@gmail.com>:

I've got an untrustworthy legacy app that seems to have a nasty
race condition when a certain number of users click on a link at
just the right time.

I'm trying to write a Ruby unit test to trigger this, but I'm a
bit stumped.

You cannot test for race conditions, only prove them.

-mental

Thanks. This code exposed the problem successfully. Not sure if it's
the cleanest way, though.

require 'test/unit'
require 'open-uri'
class TestStream < Test::Unit::TestCase
  def test_work_unit_uniqueness
    ids = []
    pipes = []
    test_cmd = %q{ ruby -ropen-uri -e"print URI('the_url').read" }
    100.times do |i|
        pipes[i] = IO::popen(test_cmd)
    end
    pipes.each do |p|
      p.each_line {|ln| ids << ln}
    end
    assert_equal(ids, ids.uniq)
  end
end

···

On 11/8/05, mental@rydia.net <mental@rydia.net> wrote:

Quoting Wilson Bilkovich <wilsonb@gmail.com>:

> I've got an untrustworthy legacy app that seems to have a nasty
> race condition when a certain number of users click on a link at
> just the right time.

> I'm trying to write a Ruby unit test to trigger this, but I'm a
> bit stumped.

You cannot test for race conditions, only prove them.

-mental

Quoting Wilson Bilkovich <wilsonb@gmail.com>:

Thanks. This code exposed the problem successfully. Not sure if
it's the cleanest way, though.

require 'test/unit'
require 'open-uri'
class TestStream < Test::Unit::TestCase
  def test_work_unit_uniqueness
    ids = []
    pipes = []
    test_cmd = %q{ ruby -ropen-uri -e"print URI('the_url').read"
}
    100.times do |i|
        pipes[i] = IO::popen(test_cmd)
    end
    pipes.each do |p|
      p.each_line {|ln| ids << ln}
    end
    assert_equal(ids, ids.uniq)
  end
end

Using Process.fork and IO.pipe might be "cleaner" than this, but
probably isn't necessary here.

However, be aware that even if this test case passes, it doesn't
reliably demonstrate the absence of a race condition.

-mental

Unfortunately, this is a SQL Server application being used as a 'work
queue', and SQL Server 2000 does not allow exclusive locks on rows.
Various hoops have been jumped through to try to shoot down the
locking problems, but the lack of a test case has made them hard to
compare, before today.
Luckily, I've rewritten the whole thing in Rails against Oracle, but
that system is still a couple of weeks away.

Are you saying that race conditions are in general impossible to
prove/disprove, or just that my code isn't the way to do it? Heh.

Thanks,
--Wilson.

···

On 11/8/05, mental@rydia.net <mental@rydia.net> wrote:

Quoting Wilson Bilkovich <wilsonb@gmail.com>:

> Thanks. This code exposed the problem successfully. Not sure if
> it's the cleanest way, though.
>
> require 'test/unit'
> require 'open-uri'
> class TestStream < Test::Unit::TestCase
> def test_work_unit_uniqueness
> ids = []
> pipes = []
> test_cmd = %q{ ruby -ropen-uri -e"print URI('the_url').read"
> }
> 100.times do |i|
> pipes[i] = IO::popen(test_cmd)
> end
> pipes.each do |p|
> p.each_line {|ln| ids << ln}
> end
> assert_equal(ids, ids.uniq)
> end
> end

Using Process.fork and IO.pipe might be "cleaner" than this, but
probably isn't necessary here.

However, be aware that even if this test case passes, it doesn't
reliably demonstrate the absence of a race condition.

-mental

Are you sure? In the article "Locking architecture", MSDN says:

  Locks can be acquired on rows, pages, keys, ranges of keys, indexes,
tables, or databases.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/architec/8_ar_sa2_2sit.asp

What exactly is the locking problem you're having? I'm interested because
we're using SQL Server 2000 to provide a work queue and we haven't had
any problems of the kind you're talking about.

Regards,

Sean

···

On 11/9/05, Wilson Bilkovich <wilsonb@gmail.com> wrote:

Unfortunately, this is a SQL Server application being used as a 'work
queue', and SQL Server 2000 does not allow exclusive locks on rows.

You may be able to demonstrate a race condition with a test case, but when the test passes you cannot be certain that the race condition no longer exists. It may simply mean that the race was "won" instead of "lost".

Instead you need to prove your program has no race conditions.

···

On Nov 8, 2005, at 4:14 PM, Wilson Bilkovich wrote:

On 11/8/05, mental@rydia.net <mental@rydia.net> wrote:

However, be aware that even if this test case passes, it doesn't
reliably demonstrate the absence of a race condition.

Are you saying that race conditions are in general impossible to
prove/disprove, or just that my code isn't the way to do it? Heh.

--
Eric Hodel - drbrain@segment7.net - http://segment7.net
FEC2 57F1 D465 EB15 5D6E 7C11 332A 551C 796C 9F04

Writing unit tests is not the way to go for race conditions -- depending
on system load, phase of the moon, CPU speed, etc, you will get false
negatives sometimes (or sometimes a lot).

There's value in constructing stuff like this to show up such a bug for
debugging, but race conditions just aren't something you can reliably
unit test for.

To prove the absence of a race condition, there's no real alternative[1]
but to look at the code and reason about it.

It's possible that I'm wrong in the practical sense, if the race
condition is extremely egregious, but unless you have precise control
over execution order, if there is a chance for the wrong code to win a
race condition, there is always a chance for the right code to win it
occasionally.

-mental

[1] I think David Overton from Mercury has been working on an automated
checker, but it has inherent limitations.

···

On Wed, 2005-11-09 at 09:14 +0900, Wilson Bilkovich wrote:

Are you saying that race conditions are in general possible to
prove/disprove, or just that my code isn't the way to do it?

Look up ROWLOCK in SQL Books Online. You might also want the READPAST
locking hint to skip locked rows held by other processes.

Regards,

Sean

···

On 11/9/05, Sean O'Halpin <sean.ohalpin@gmail.com> wrote:

On 11/9/05, Wilson Bilkovich <wilsonb@gmail.com> wrote:
> Unfortunately, this is a SQL Server application being used as a 'work
> queue', and SQL Server 2000 does not allow exclusive locks on rows.

With enough control over the components under test, it is often possible
to 'rig the race', such that a specific error in serialisation is forced
to occur (i.e. artificially blocking one of the threads for the duration
of the test).

(Probably not possible in this instance though.)

dave

···

On Wed, Nov 09, 2005 at 01:06:32PM +0900, MenTaLguY wrote:

There's value in constructing stuff like this to show up such a bug for
debugging, but race conditions just aren't something you can reliably
unit test for.

--
http://david.holroyd.me.uk/

Sorry to pollute the mailing list with SQL Server talk, folks. I beg
forgiveness in advance. :slight_smile:

The application is (amongst other things) a data entry system for
volunteers working on hurricane disaster recovery.
They click the "Give me something new to work on" button, and the
system needs to find the first pending item of work, assign it to
them, and redirect them to a page where they can see scanned PDF
images, etc, etc. The data is then hammered into a mainframe system
by distributed (Ruby) clients that hit a web page to find out which id
number to work next.

The original logic was:
Begin Transaction
@id = select min(id) from some_table where status_code = 'P' --(for pending)
update some_table set status_code = 'W' where id = @id
Commit

With default locking, two requests might select the same record before
the first update is issued. Because this happens inside a transaction,
the other transactions don't 'see' this pending change until they
themselves have committed.

READPAST, ROWLOCK, and XLOCK are the usual suggestions for this kind of thing.
However, a careful study of the SQL Server docs shows that XLOCK and
ROWLOCK are not compatible. Only pages and tables can be locked with
XLOCK. That would be OK, except the READPAST only works with row
locks, not with anything larger. Heh.

Because this system is live 24/7 right now, we can't make drastic
changes. However, it seems like (READPAST, UPDLOCK) on the select
statement, and then checking to see how many rows were updated in the
update statement has fixed it. If no rows were updated (due to the
lock), then we just restart the transaction. That seems incredibly
ugly to me, but luckily this system is going away soon.

That being said, I'm definitely not a SQL Server expert (I prefer
Oracle and Postgres), so if you can recommend a better solution, I'd
be in your debt.

Thanks,
--Wilson.

···

On 11/8/05, Sean O'Halpin <sean.ohalpin@gmail.com> wrote:

On 11/9/05, Sean O'Halpin <sean.ohalpin@gmail.com> wrote:
> On 11/9/05, Wilson Bilkovich <wilsonb@gmail.com> wrote:
> > Unfortunately, this is a SQL Server application being used as a 'work
> > queue', and SQL Server 2000 does not allow exclusive locks on rows.

Look up ROWLOCK in SQL Books Online. You might also want the READPAST
locking hint to skip locked rows held by other processes.

Regards,

Sean