Ruby 2.3.3 Deadlocks from Open3

I've started to encounter an increase of "No live threads left. Deadlock?"
thrown from Open3 as load increase on our system. The system is a typical
competing consumer model with a single producer and multiple consumers
running across a number of VMs where multiple consumers can reside on the
same VM. The consumers are running Ruby 2.3.3 on Ubuntu 14.04 in EC2. I've
started to see an increase in errors due to potential deadlocks as load
increases, e.g. consumers are doing more work. The consumers use Open3 to
spawn a child process and read from stdout and stderr, executing business
logic based on the output. I've have seen these deadlock errors from
Open3#popen2e, Open3#capture3 and Open3#popen3. In the case of
Open3#popen2e and Open#popen3 I'm using separate threads for reading from
stdout and stderr and calling Thread#value on the wait thread in order to
determine the exit status of the child process. Open3#capture3 already
reads from stdout and stderr in separate threads so the fact that I'm see
these errors from Open3#capture3 seems to indicate the issue may not be
with our application code. Based on the stacktrace it seems the call to
Thread#value is where the Ruby runtime is detecting a potential dead lock
due to multiple threads either waiting on the same resource or all threads
have died or are sleeping. However, I'm simply reading from stdout and
stderr on separate threads and having the main thread join the wait thread
from Open3. There is no thread contention for shared resources in my
application code; at least not on purpose. Having looked at dmsg and VM
metrics it does not appear there were any abnormal memory or CPU issues
around the time of the errors. I have a work around for the issue using
Thread#join with a timeout and checking if the return value from
Thread#join is nil to determine if Thread#join timed out. In the event of a
time out I can execute some business logic and retry. However, I'd like to
determine the root cause of the issue. I'm wondering if anyone else on the
mailing list might have encountered similar issues before and has any
suggestions or guidance for debugging.

Here is a distilled version of what I'm doing, nothing complicated, and a
potential solution.

- Tucker

ยทยทยท

--
Tucker Barbour
tucker.barbour@gmail.com