DRb Mysterious Stops

I'm running a fairly complicated build and test system with DRb over
Ruby 1.8.6. It involves 12 Linux machines running several different
distro versions and one Windows machine.

Lately I've been having problems where once in awhile the machines
involved in this system just stop communicating, and I can't figure out
why. I've found on occasion I can work around the problem by changing
the order of the operations or the frequency of them. It's more or less
random when it occurs.

The only thing I can think of is that this all started when I added suse
9.3 and 9.4 machines to this system.

The other possibility is that now I have 12 Linux machines and a Windows
machine all more or less arbitrarily talking with each other, so there
might be a slowly increasing probability of a deadlock that I'm suddenly
noticing because it's more likely with more machines.

I'm sitting here thinking of exotic ways TCP could be misconfigured out
of the box on suse 9. But deep in my soul I'm sure it's some stupid code
I wrote.

Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.

If anyone has insight, I'd appreciate it. I'm running out of good ideas
here.

···

--
Darrin
--
Posted via http://www.ruby-forum.com/.

Darrin Thompson wrote:

I'm running a fairly complicated build and test system with DRb over
Ruby 1.8.6. It involves 12 Linux machines running several different
distro versions and one Windows machine.

Lately I've been having problems where once in awhile the machines
involved in this system just stop communicating, and I can't figure out
why. I've found on occasion I can work around the problem by changing
the order of the operations or the frequency of them. It's more or less
random when it occurs.

The only thing I can think of is that this all started when I added suse
9.3 and 9.4 machines to this system.

The other possibility is that now I have 12 Linux machines and a Windows
machine all more or less arbitrarily talking with each other, so there
might be a slowly increasing probability of a deadlock that I'm suddenly
noticing because it's more likely with more machines.

I'm sitting here thinking of exotic ways TCP could be misconfigured out
of the box on suse 9. But deep in my soul I'm sure it's some stupid code
I wrote.

Anyway, the idea here is that a Windows machine sends messages to
several Linux machines and the Linux machines send back log messages and
occasionally a series of messages that represent the contents of a file.

If anyone has insight, I'd appreciate it. I'm running out of good ideas
here.

--
Darrin

It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that, though.)

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Joel VanderWerf wrote:

It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that,
though.)

Tried that. No processes died or left any traces in my logs.

What I did get was more consistent [bad] behavior, at least for today.

It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, _and_ I return a large return value from that
method, the conversation somehow gets "stuck". Large might be an array
of 8000 lines of text from a file.

I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.

That's all I've got so far. Any more help?

···

--
Darrin
--
Posted via http://www.ruby-forum.com/.

Hi,
I've had same problem a while ago - my program simply stopped to
communicate with the remote and I also couldn't figure out why. First I
was restarting the program periodically via cron and later rewrote it to
send just UDP messages. I just needed to signal another process on
another host so this was a good option for me. The program was running
(and stopping) on FreeBSD.
If you have a lot of data to send I'd consider to use xml-rpc or soap
instead or drb.

Martin

···

On Thu, 2009-08-27 at 00:39 +0900, Darrin Thompson wrote:

That's all I've got so far. Any more help?

That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?

Are some of the earlier machines also 64bit?

I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.

Thanks

Michal

···

2009/8/26 Darrin Thompson <darrinth@gmail.com>:

Joel VanderWerf wrote:

It might help to add

Thread.abort_on_exception = true

in case a drb thread is dying silently. (DRb might be smarter than that,
though.)

Tried that. No processes died or left any traces in my logs.

What I did get was more consistent [bad] behavior, at least for today.

It seems that any time my windows machine calls a method on my hosted
suse9 64 bit machine, _and_ I return a large return value from that
method, the conversation somehow gets "stuck". Large might be an array
of 8000 lines of text from a file.

I watched the conversation with wireshark and I saw that on one of my
failures, right before everything hung, there were 15 tcp dup acks.

Alternatively implement file transfer on top of DRb, which could be
simply remote iterating through the file in chunks. That would avoid
issues with arbitrary large DRb method arguments or return values.
Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.

Kind regards

robert

PS: I don't believe in IP misconfiguration either. :slight_smile:

···

2009/8/27 Martin Boese <boesemar@gmx.de>:

On Thu, 2009-08-27 at 00:39 +0900, Darrin Thompson wrote:

That's all I've got so far. Any more help?

I've had same problem a while ago - my program simply stopped to
communicate with the remote and I also couldn't figure out why. First I
was restarting the program periodically via cron and later rewrote it to
send just UDP messages. I just needed to signal another process on
another host so this was a good option for me. The program was running
(and stopping) on FreeBSD.
If you have a lot of data to send I'd consider to use xml-rpc or soap
instead or drb.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Michal Suchanek wrote:

That many dup acks are somewhat suspicious but by themselves they
should not cause a lockup. If there is network problem the connection
should break eventually. However, some dumps of the part of the
conversation that causes excessive packet duplication might be useful.
Can you replicate the packet duplication with something simple like
scp file transfer or the like?

I can replicate it with a tiny tiny drb pair of programs.

On my SLES9 machine:
# cat test.rb
require 'drb'

class Echo
    def ping(length)
        return 'a' * length
    end
end

echo = Echo.new

DRb.start_service("druby://0.0.0.0:9000", echo)
DRb.thread.join

On my Windows machine:
require 'drb'

echo = DRb::DRbObject.new_with_uri("druby://172.31.192.159:9000")
response = echo.ping(ARGV[0].to_i)
puts response.length

When the program succeeds it's prints the number given. When it fails,
it hangs until I kill it. I'm finding that short values always succeed,
like 1024. I get some successful and some failed when I provide 44230 as
the arg.

I have saved packet traces of a successful run at 1024 and a failed run
at 1 Mb. The traces are a few K and 70+K respectively. I can provide
them privately.

Are some of the earlier machines also 64bit?

Yes.

I am not sure how 32bit vs 64bit integers work with marshalling. It
should work but perhaps some testing to ensure it really works well
would be a good idea.

A lot of other 32/64 bit conversations with other machines are working
fine, so I'm reluctant to go there.

···

--
Darrin
--
Posted via http://www.ruby-forum.com/.

Robert Klemme wrote:

Although I have to say that my expectation would be that arbitrary
large Strings should not cause issues with DRb - that would sound like
a bug to me.

So trolling through the drb code I came across this:

    def load(soc) # :nodoc:
      begin
        sz = soc.read(4) # sizeof (N)
      rescue
        raise(DRbConnError, $!.message, $!.backtrace)
      end
      raise(DRbConnError, 'connection closed') if sz.nil?
      raise(DRbConnError, 'premature header') if sz.size < 4
      sz = sz.unpack('N')[0]
      raise(DRbConnError, "too large packet #{sz}") if @load_limit < sz
      begin
        str = soc.read(sz)
      rescue
        raise(DRbConnError, $!.message, $!.backtrace)
      end
      raise(DRbConnError, 'connection closed') if str.nil?
      raise(DRbConnError, 'premature marshal format(can\'t read)') if
str.size <
sz
      Thread.exclusive do
        begin
          save = Thread.current[:drb_untaint]
          Thread.current[:drb_untaint] = []
          Marshal::load(str)
        rescue NameError, ArgumentError
          DRbUnknown.new($!, str)
        ensure
          Thread.current[:drb_untaint].each do |x|
            x.untaint
          end
          Thread.current[:drb_untaint] = save
        end
      end
    end

Is it possible that the thread.exclusive bit could deadlock on a windows
machine?

···

--
Darrin

--
Posted via http://www.ruby-forum.com/.

DRb.start_service("druby://0.0.0.0:9000", echo, {:load_limit => 2**31}) ?

···

On 2009/08/29, at 1:55, Darrin Thompson wrote:

I can replicate it with a tiny tiny drb pair of programs.

On my SLES9 machine:
# cat test.rb
require 'drb'

class Echo
   def ping(length)
       return 'a' * length
   end
end

echo = Echo.new

DRb.start_service("druby://0.0.0.0:9000", echo)
DRb.thread.join

Darrin Thompson wrote:

I can replicate it with a tiny tiny drb pair of programs.

And I think I've just ruled out ruby and DRb as the culprits here.

I ran a test like this:

ssh root@ipofbadmachine cat /dev/urandom | hexdump -C

I run that from windows xp in cygwin and it hangs after a few seconds.
Other machines it can run as long as I let it.

Sorry for the noise.

···

--
Darrin
--
Posted via http://www.ruby-forum.com/.

SEKI Masatoshi wrote:

DRb.start_service("druby://0.0.0.0:9000", echo, {:load_limit =>
2**31}) ?

No effect.

···

--
Darrin

--
Posted via http://www.ruby-forum.com/.

SEKI Masatoshi wrote:

DRb.start_service("druby://0.0.0.0:9000", echo, {:load_limit =>
2**31}) ?

I'm breaking this with message sizes of ~50k.

···

--
Darrin

--
Posted via http://www.ruby-forum.com/.

No problem. Btw, netcat is another tool for your network debugging
toolbox which might help for these network transmission tests:

Kind regards

robert

···

2009/8/31 Darrin Thompson <darrinth@gmail.com>:

Darrin Thompson wrote:

I can replicate it with a tiny tiny drb pair of programs.

And I think I've just ruled out ruby and DRb as the culprits here.

I ran a test like this:

ssh root@ipofbadmachine cat /dev/urandom | hexdump -C

I run that from windows xp in cygwin and it hangs after a few seconds.
Other machines it can run as long as I let it.

Sorry for the noise.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/