Process.fork weirdness

Hi all,

I've run into some strange problems when forking child processes and
logging it in a file. Here's a simplified example:

test = open("output.txt", "w")
1.upto(10) do |i|
  test.print(i.to_s + ",")
  Process.fork {sleep(10)}
  test.puts i
end
Process.waitall

Now I would expect this as output:

1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
10,10

However, the above script produces this instead:

1,1,1
2,1,1
2,2
3,1,1
2,2
3,3
4,1,1
2,2
3,3
4,4
5,1,1
2,2
3,3
4,4
5,5
6,6
7,1,1
2,2
3,3
4,4
5,5
6,1,1
...

It goes on until both values are 10, as if there were two loops within
each other. Note that 7,1,1 appears before 6,1,1 though. I have
absolutely no idea how this is produced, but it must be caused by
forking. If I remove Process.fork I get the expected output. Also if I
open and close the file each time for writing I also get the expected
output:

1.upto(10) do |i|
  test = open("output.txt", "a")
  test.print(i.to_s + ",")
  test.close
  Process.fork {sleep(10)}
  test = open("output.txt", "a")
  test.puts i
  test.close
end
Process.waitall

I spend a lot of time figuring out, that my algorithm was in fact
working ok, but my way of writing the output was causing the problems.
I'd really like to know what's the reason this behaviour. Any
suggestions?

Regards,
janus

···

--
Posted via http://www.ruby-forum.com/.

It goes on until both values are 10, as if there were two loops within
each other. Note that 7,1,1 appears before 6,1,1 though. I have
absolutely no idea how this is produced, but it must be caused by
forking. If I remove Process.fork I get the expected output.

You need to flush the output stream before forking so that the child process doesn't get any buffered data. Otherwise, every time the child process exits, it will flush the duplicate data out to the 'output.txt' file.

···

On Jul 12, 2009, at 3:04 PM, Janus Bor wrote:

test = open("output.txt", "w")
1.upto(10) do |i|
   test.print(i.to_s + ",")
   test.flush # ensures that forked child has clean IO buffers.
    Process.fork {sleep(10)}
   test.puts i
end
Process.waitall

Gary Wright wrote:

You need to flush the output stream before forking so that the child process doesn't get any buffered data. Otherwise, every time the child process exits, it will flush the duplicate data out to the 'output.txt' file.

Huh, I never knew that. That's dangerous if the parent process is multithreaded, isn't it? Because even if the forking thread flushes all open files, other threads can keep filling buffers before the fork happens.

In process.c, rb_f_fork() flushes stdout and stderr before forking, so at least these two buffers are protected from other ruby threads. It seems like there should be an option to flush everything.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Thanks for the explanation!

···

--
Posted via http://www.ruby-forum.com/.

Joel VanderWerf wrote:

Gary Wright wrote:

You need to flush the output stream before forking so that the child process doesn't get any buffered data. Otherwise, every time the child process exits, it will flush the duplicate data out to the 'output.txt' file.

Huh, I never knew that. That's dangerous if the parent process is multithreaded, isn't it? Because even if the forking thread flushes all open files, other threads can keep filling buffers before the fork happens.

In process.c, rb_f_fork() flushes stdout and stderr before forking, so at least these two buffers are protected from other ruby threads. It seems like there should be an option to flush everything.

Maybe the best policy is to do any one of the following:

1. Avoid buffered output (puts, write), and instead use unbuffered calls (syswrite, send) or set io.sync=true.

2. Don't mix threads and fork, and flush everything before forking.

3. Create children only as

   fork {do_stuff; exit!}
   fork {do_stuff; exec ...}
   system "..."

and so on. (And be careful that the do_stuff doesn't flush any buffered IO that belongs to the parent.)

Most libraries aren't going to follow #1, and #2 seems impractical to enforce.

Actually, I guess #3 is why I haven't seen this problem before... it's often a good idea to use exit! in fork to avoid duplicating at_exit() handlers anyway. I wish there were a method that the child could call before anything else to _clear_ (not flush) all buffers and also to clear the at_exit handlers.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

It is probably a bad idea to have multiple threads writing to a shared file descriptor without proper synchronization. With proper synchronization you should be able to safely flush and fork.

···

On Jul 13, 2009, at 1:18 PM, Joel VanderWerf wrote:

Huh, I never knew that. That's dangerous if the parent process is multithreaded, isn't it? Because even if the forking thread flushes all open files, other threads can keep filling buffers before the fork happens.

Gary Wright wrote:

Huh, I never knew that. That's dangerous if the parent process is multithreaded, isn't it? Because even if the forking thread flushes all open files, other threads can keep filling buffers before the fork happens.

It is probably a bad idea to have multiple threads writing to a shared file descriptor without proper synchronization. With proper synchronization you should be able to safely flush and fork.

I was thinking of situations in which one program component runs in a
worker thread and manages its own files, not shared with the main thread
(but the descriptors are shared in the same process of course). If the
files are log files, they would probably have IO#sync=true, so no danger
there. But what about a library using IO#write to talk http, a rest
client, let's say? This would not want IO#sync=true. It might not
really be possible to make the library aware of the need for synchronization around buffered writes. In this case, maybe it would be better to "firewall" the http library in its own process, and communicate over unix sockets.

Anyway, thanks for making me aware of one more thing to worry about!

···

On Jul 13, 2009, at 1:18 PM, Joel VanderWerf wrote:

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Joel VanderWerf wrote:

But what about a library using IO#write to talk http, a rest
client, let's say? This would not want IO#sync=true. It might not
really be possible to make the library aware of the need for
synchronization around buffered writes.

I don't think you'd have stdio-type buffering on a socket. At least, I
don't find that I have to do sock.flush every time I send a message.

···

--
Posted via http://www.ruby-forum.com/\.

Brian Candler wrote:

Joel VanderWerf wrote:

But what about a library using IO#write to talk http, a rest
client, let's say? This would not want IO#sync=true. It might not
really be possible to make the library aware of the need for
synchronization around buffered writes.

I don't think you'd have stdio-type buffering on a socket. At least, I don't find that I have to do sock.flush every time I send a message.

Ah, that makes sense. It would be only the libc and ruby buffers (in user space) that would get copied in the fork, not network stack buffers in kernel space.

But wait... Net:HTTP uses BufferedIO#writeline on sockets. That class doesn't itself buffer writes (only reads), but it is defined in terms of IO#write, which is buffered, isn't it?

I guess the reason buffering makes sense with Net::HTTP, as opposed to most other socket code, is that it writes header lines one at a time.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Ah, that makes sense. It would be only the libc and ruby buffers (in user space) that would get copied in the fork, not network stack buffers in kernel space.

That is correct. IO data buffered in the kernel is not duplicated by a fork.

But wait... Net:HTTP uses BufferedIO#writeline on sockets. That class doesn't itself buffer writes (only reads), but it is defined in terms of IO#write, which is buffered, isn't it?

Ruby marks IO objects associated with sockets as 'synchronized' (i.e. unbuffered). Net::HTTP wraps the underlying ruby socket with BufferedIO but my quick inspection of the code seems to indicate that BufferedIO doesn't introduce a write buffer, just some debugging scaffolding and a helper method for writing lines terminated with '\r\n'.

···

On Jul 14, 2009, at 12:23 PM, Joel VanderWerf wrote:

Gary Wright wrote:

But wait... Net:HTTP uses BufferedIO#writeline on sockets. That class doesn't itself buffer writes (only reads), but it is defined in terms of IO#write, which is buffered, isn't it?

Ruby marks IO objects associated with sockets as 'synchronized' (i.e. unbuffered).

Good to know, thanks. I always just used send/recv to be sure, but now I know that's not necessary.

There really don't seem to be many cases in which fork is going to cause buffer corruption, then, which explains why the issue doesn't come up much. We can rule out the cases of sockets (sync=true), log files (sync=true), most binary data files (syswrite), and pipes (usually a good idea to set sync=true, AFAIK). That leaves batch filtering of files, which is most often performed on stdin/stdout, and these descriptors are flushed in ruby's #fork wrapper.

It's a real joy to retrace the steps of ruby developers, and realize how many wise decisions they made.

One thing left to watch out for is race conditions in (for example):

File.open(..., "w") do |f|
   YAML.dump(..., f)
   ### before the file is closed, suppose another thread forks and exits
end

This example shows it in action:

   require 'yaml'

   a = [1,2,3]

   th = Thread.new do
     File.open("/tmp/foo.yml", "w") do |f|
       YAML.dump a, f
       sleep 3
     end
   end

   fork do
     sleep 1
   end

   th.join

The result is the foo.yml has two concatenated copies of the correct yaml string.

This doesn't happen with Marshal.dump. Probably since it is core ruby, this possibility was anticipated. Though I don't see in marshal.c where fflush is called, and sync is false after the #dump.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

That leaves batch filtering of files, which is most often performed on stdin/stdout, and these descriptors are flushed in ruby's #fork wrapper.

Interesting. I didn't realize that Kernel#fork flushes stdout and stderr. That should probably be in the rdoc text for fork.

One thing left to watch out for is race conditions in (for example):

File.open(..., "w") do |f|
YAML.dump(..., f)
### before the file is closed, suppose another thread forks and exits
end

I can think of three ways to avoid these sorts of race conditions:

-- do all forking before opening files
-- do all forking before creating any threads
-- use a synchronization mechanism to control the timing of
    forks in a multi-threaded process

In all cases I think it requires library code to be aware of these constraints. I'm not sure there is a way to protect buffered IO objects against a library that can fork at indeterminate times.

Gary Wright

···

On Jul 14, 2009, at 4:29 PM, Joel VanderWerf wrote: