Any solaris gurus out there?
I’m having trouble porting some multi-thread, multi-process code from
linux to solaris. I’ve already dealt with (or tried to deal with) some
differences in flock (solaris flock is based on fcntl locks), like the
fact that closing a file releases locks on the file held by other threads.
I’ve managed to isolate the problem in a fairly simple test program. It’s at
http://path.berkeley.edu/~vjoel/ruby/solaris-bug.rb
The program creates /tmp/test-file-lock.dat, which holds a marshalled
fixnum starting at 0. Then it creates Np processes each with Nt threads
which do a random sequence of reads and writes using some locking
methods. The writes just increment the counter.
When a process is done, it writes the number of times it incremented the
counter to the file /tmp/test-file-lock.dat#{pid}. Then the main process
adds these up and compares with the contents of the counter file. The
point of this is to test for colliding writers.
But the program fails before that final test–it seems to be having a
collision between a reader and a writer that causes the reader to see a
corrupt file.
A typical run fails like this. The counter 0…3 is a seconds clock:
$ ruby solaris-bug.rb
0
1
2
3
solaris-bug.rb:128:in `load’: marshal data too short (ArgumentError)
It looks like there are a reader and a writer accessing the file at the
same time, and the writer has just truncated the file (line 137) when
the reader tries to read it.
This happens:
-
on solaris, quad cpu
- ruby 1.7.3 (2002-10-30) [sparc-solaris2.7]
-
not on single processor linux
- ruby 1.7.3 (2002-12-12) [i686-linux]
-
not on dual SMP linux
- ruby 1.6.7 (2002-03-01) [i686-linux]
Also, the bug requires both of:
-
thread_count >= 2
-
process_count >= 2
Also, the bug requires that there be both reader and writer operations
(i.e., that the random number lead to each branch often enough, say 50/50).