Something is wrong with my threads

Would any of you smart folks like to tell me what stupid thing I am doing here?

At this point I have a test program that just uses plain MRI, no gems. If I keep running it, it will eventually fail (after a dozen runs at most). Code at the end of this post.

What I'm doing is creating a lot of identical threads. In each one I stop immediately, and then after the thread is re-entered, set a thread variable to say the thread is done then stop again. I run them all, and then check they are all done.

The problem is, sometimes they aren't all done. Which makes no sense to me. Either `thread#run` sometimes doesn't wake a thread, or doesn't pass control to the thread; or something is wrong with thread variables?

Code:

require "time"

def make_thread(idx)
  Thread.new do
    # Set things up and wait
    Thread.current[:idx] = idx
    Thread.stop

    # do the work, then say we're done
    Thread.current[:done] = true
    Thread.stop
  end
end

threads = []; 1.upto(200){|i| threads << make_thread(i) }
starttm = Time.now

# Do the work
threads.each{|t| t.run }

# Wait until all work is done or we time out
sleep 0.1 until (threads.all?{|t| t[:done] } || Time.now >= starttm + 10)

# If all work is not done, something has gone very wrong?
fails = threads.reject{|t| t[:done] }
if fails.size > 0
  puts "FAIL"
  fails.each{|e| puts "#{e[:idx]}: #{e.alive?}" }
end

Click here to view Company Information and Confidentiality Notice.<http://www.jameshall.co.uk/index.php/small-print/email-disclaimer>

Please note that we have updated our privacy policy in line with new data protection regulations. Please refer to our website to view the ways in which we handle your data.

The problem is, sometimes they aren't all done. Which makes no sense
to me. Either `thread#run` sometimes doesn't wake a thread,

Quote from the docs for Thread#run, emphasis mine:

Wakes up `thr', making it **eligible** for scheduling.

Thread#run does not necessaryly run the thread. It only makes it
eligible for scheduling. Whether or not its execution is continued, the
Ruby interpreter decides.

What you are doing looks like re-inventing Thread#join. Can't you just
use that method? If you want a a timeout for the wait, Thread#join does
take an argument (`limit') for that purpose. See the documentation for
Thread#join.

Marvin

···

Am 17. April 2019 um 15:31 Uhr +0000 schrieb Andy Jones:

--
Blog: https://mg.guelker.eu

Would any of you smart folks like to tell me what stupid thing I am doing here?

SO MANY THINGS!

First off, you’re coding in a proportional font, and that always leads to bugs. :stuck_out_tongue:

More seriously, I’ll comment inline and then provide my counter example:

require “time"

You don’t need this require to use Time.now.

def make_thread(idx)

I would avoid having a make_threads method. Methods are opaque, and often you create threads to do work from a queue. You can pass everything in to your make_threads method, or you can just use blocks.

  Thread.new do
    # Set things up and wait
    Thread.current[:idx] = idx
    Thread.stop

Why call Thread.stop, at all? What do you care? This smells of a control issue.

    # do the work, then say we're done
    Thread.current[:done] = true

I generally don’t store state in my threads. Threads are usually workers, not containers. They usually get data from some source, do work, and then put resultant data into another source.

    Thread.stop

Ditto to #stop above. This thread would have been done if this line wasn’t here at all and it got to the end of it’s block. Control issues. Let go.

  end
end

threads = ; 1.upto(200){|i| threads << make_thread(i) }

container = ; something.each { container << thing }

is the implementation for:

container = something.map { thing }

starttm = Time.now

# Do the work
threads.each{|t| t.run }

They would be running if you didn’t stop them.

# Wait until all work is done or we time out
sleep 0.1 until (threads.all?{|t| t[:done] } || Time.now >= starttm + 10)

I don’t get this. Leave them alone. Use Thread#join to let them finish their work and stop on their own.

# If all work is not done, something has gone very wrong?
fails = threads.reject{|t| t[:done] }
if fails.size > 0
  puts "FAIL"
  fails.each{|e| puts "#{e[:idx]}: #{e.alive?}" }
end

Don’t poke at your threads. It makes them angry.

···

On Apr 17, 2019, at 08:31, Andy Jones <Andy.Jones@jameshall.co.uk> wrote:

-----

Here’s my counter:

t0 = Time.now

threads = 200.times.map { |i|
  Thread.new do
    Thread.current[:idx] = i # not necessary

    sleep 1 # lots of "work"
  end
}

threads.each(&:join) # finish working

p Time.now - t0

# => 1.013938

One post to reply to both Ryan and Marvin:

What you are doing looks like re-inventing Thread#join. Can't you just use that method?

Good question. I don't _think_ I can. I want the code in the thread to be executed IN that thread, not the main one? Isn't that what #join does? (In MRI this doesn't really matter, but in jRuby it does, and this code has to work in both. See below.)

But it's a moot point in the end, because AFAICS you can't call #join on a stopped thread? If I swap `run` for `join` I just get "No live threads left. Deadlock? (fatal)"

Why call Thread.stop, at all? What do you care? This smells of a control issue.

The _real_ code, the one I want to use, is inside an rspec test. It's testing a race condition between multiple threads. I start 50 threads, give them a time, and then when I restart them they all try and call the same method at that time. Then I test whether the method screwed up or not. So, yes, it has control issues. Testing is all about that :wink:

(Obviously this makes no sense at all under MRI because of the GIL. This is a library that needs to work in jRuby, though. Perversely, it works fine in jRuby but screws up in MRI...)

Click here to view Company Information and Confidentiality Notice.<http://www.jameshall.co.uk/index.php/small-print/email-disclaimer&gt;

Please note that we have updated our privacy policy in line with new data protection regulations. Please refer to our website to view the ways in which we handle your data.

Good question. I don't _think_ I can. I want the code in the thread to be executed IN that thread, not the main one? Isn't that what #join does? (In MRI this doesn't really matter, but in jRuby it does, and this code has to work in both. See below.)

That is NOT what join does. Join runs the thread (in the thread, not main) and waits for it to finish. But it sounds like that isn’t what you want based on this:

The _real_ code, the one I want to use, is inside an rspec test. It's testing a race condition between multiple threads. I start 50 threads, give them a time, and then when I restart them they all try and call the same method at that time. Then I test whether the method screwed up or not. So, yes, it has control issues. Testing is all about that :wink:

You don’t test race conditions by simulating non-deterministic conditions.

You test a race condition by deterministically faking the race collision.

You do this by designing your code to allow for your test to interact with it deterministically… I just helped someone with EXACTLY this problem by suggesting that they either refactor their code so the race creation condition is in it’s own method and thus overridable for testing OR by adding a `yield if block_given?` in the middle so the test running the method can force the race condition to happen via a block.

OR you don’t test it at all, because that way lies madness.

Or you design your code to not have the race condition at all (better).

···

On Apr 18, 2019, at 00:45, Andy Jones <Andy.Jones@jameshall.co.uk> wrote:

You don’t test race conditions by simulating non-deterministic conditions.
You test a race condition by deterministically faking the race collision.

I rather thought that I was doing the latter. I'm attempting to arrange for fifty threads to call a method at the same time.

That in itself is not deterministic -- I can't guarantee to be successful in doing that every time, unfortunately, or I could just make two threads instead of fifty. But I'm not "simulating a non-deterministic condition". The condition is deterministic; if I manage to call the same method twice at exactly the same time, then there will always be a problem. (Except the code uses a Mutex, so I'm testing that there is not one.)

I'm really not sure what point you are making here.

Or you design your code to not have the race condition at all (better).

I have. I've used a Mutex to eliminate the possibility of a race condition. Now I need to test that that works.

Click here to view Company Information and Confidentiality Notice.<http://www.jameshall.co.uk/index.php/small-print/email-disclaimer&gt;

Please note that we have updated our privacy policy in line with new data protection regulations. Please refer to our website to view the ways in which we handle your data.

> You don’t test race conditions by simulating non-deterministic conditions.
> You test a race condition by deterministically faking the race collision.

I rather thought that I was doing the latter. I'm attempting to arrange for fifty threads to call a method at the same time.

But you are aware that this is not reliably possible no matter what
you attempt, are you?

That in itself is not deterministic -- I can't guarantee to be successful in doing that every time, unfortunately, or I could just make two threads instead of fifty. But I'm not "simulating a non-deterministic condition". The condition is deterministic; if I manage to call the same method twice at exactly the same time, then there will always be a problem. (Except the code uses a Mutex, so I'm testing that there is not one.)

I am not sure I am following here. What is the goal you want to achieve?

> Or you design your code to not have the race condition at all (better).

I have. I've used a Mutex to eliminate the possibility of a race condition. Now I need to test that that works.

A mutex guarantees that there are no two threads in the critical
section. You do not need to prove that it works that way. Others have
done that for you already. If you want to do something like that you
could do it like in the attached script.

Kind regards

robert

p.rb (357 Bytes)

···

On Thu, Apr 18, 2019 at 10:50 AM Andy Jones <Andy.Jones@jameshall.co.uk> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/

But you are aware that this is not reliably possible no matter what you attempt, are you?

It's literally not possible in MRI because of the GIL. Is that what you mean? Yes, but jRuby doesn't have one, and two threads can genuinely call the same method at the same time. My test works fine in jRuby, but I need to ensure I've not broken the code in MRI, too, so the test has to run there as well.

Or do you mean that I can't guarantee for certain that two calls will happen at exactly the same time? Unfortunately true. That's why I create 50 threads, each calling the method, instead of two. In the code you attach you do exactly the same thing: call the method multiple times in threads hoping to get a collision.

A mutex guarantees that there are no two threads in the critical section. You do not need to prove that it works that way. > Others have done that for you already.

Well, there goes the entire of TDD... How do I know I've used a mutex correctly? That I've synchronised the right bit of code? That I haven't introduced other problems? This is why we test. Or, why I test, anyway; YMMV.

For the record, I've re-read the docs on Thread and I think you are right: #run doesn't guarantee that execution will switch to the thread. (That's a hell of a gotcha.) I've modified the test so that it counts the number of threads that _did_ get woke, and uses that to work out what the expected results are. It seems to work fine now. Thanks for your help.

Click here to view Company Information and Confidentiality Notice.<http://www.jameshall.co.uk/index.php/small-print/email-disclaimer&gt;

Please note that we have updated our privacy policy in line with new data protection regulations. Please refer to our website to view the ways in which we handle your data.

> But you are aware that this is not reliably possible no matter what you attempt, are you?

It's literally not possible in MRI because of the GIL. Is that what you mean?

No. This has nothing to do with GIL. The reason is that the kernel is
free to schedule threads at its discretion. So even if 50 threads are
ready to run at the same time there are no guarantees as to when they
get CPU - at least on a non realtime OS.

Yes, but jRuby doesn't have one, and two threads can genuinely call the same method at the same time. My test works fine in jRuby, but I need to ensure I've not broken the code in MRI, too, so the test has to run there as well.

They *can* but you cannot force them to do so!

Or do you mean that I can't guarantee for certain that two calls will happen at exactly the same time? Unfortunately true. That's why I create 50 threads, each calling the method, instead of two. In the code you attach you do exactly the same thing: call the method multiple times in threads hoping to get a collision.

Exactly.

> A mutex guarantees that there are no two threads in the critical section. You do not need to prove that it works that way. > Others have done that for you already.

Well, there goes the entire of TDD... How do I know I've used a mutex correctly? That I've synchronised the right bit of code? That I haven't introduced other problems? This is why we test. Or, why I test, anyway; YMMV.

Well, yes, but for a single method the test is pretty uninteresting.
If you want to do some multi threading stress testing you would have
to invoke several different methods / code paths that are all supposed
to use the same mutex. Even then you are not guaranteed to get the
Scott free card for multithreading - it will only tell you of issues
if they come up. Multi threading is one of the areas where I would say
testing can only be of limited help. Code / design reviews are
important here.

For the record, I've re-read the docs on Thread and I think you are right: #run doesn't guarantee that execution will switch to the thread. (That's a hell of a gotcha.) I've modified the test so that it counts the number of threads that _did_ get woke, and uses that to work out what the expected results are. It seems to work fine now. Thanks for your help.

Thread#run is not a method that should be used. Rather you use a
condition variable or other thread synchronization means. A barrier
could work well, too. (Not sure I have seen one in Ruby libraries
though but you could implement one using a ConditionVariable.)

Cheers

robert

···

On Tue, Apr 23, 2019 at 9:39 AM Andy Jones <Andy.Jones@jameshall.co.uk> wrote:

--
[guy, jim, charlie].each {|him| remember.him do |as, often| as.you_can
- without end}
http://blog.rubybestpractices.com/