Quoting Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>:
I think it's a little bit to simple to be put in the standard lib.
a.sort_by{rand}.each {|e| print e , ' '}
But it turns out to not be that simple. This can give a biased sort depending on the underlying sort algorithm,
The linked Haskell implementation also gives a very thorough discussion, including a demonstration of how a stable sorting algorithm will yield biased results.
Note that the discussion there does not apply to the implementation quoted, but rather to this implementation:
a.sort_by{rand(a.size)}.each {|e| print e , ' '}
which I think we can all see the problem with. Absent any collisions in the values chosen by rand each time through the block, the short implementation quoted above is unbiased.
You don't provide anything to back up this assertion. What makes this implementation unbiased? What makes the other implementation biased? What's the crucial difference between the two?
(technical note - #sort_by calculates the key assigned to each element only once. In other words, each element of 'a' is assigned one and only one random key, and is then sorted according to that key. See the docs for more info.)
The only difference between the two implementations is the range of the random numbers - the first implementation uses a float in the range [0, 1.0), the second is an integer from in the range (0...a.size). For sorting purposes, though, the particular range of the keys is irrelevant, all that matters is that the keys are comparable (you could use strings if you felt like it, too).
This is proven very simply: given a finite set of N keys in the float range [0, 1.0), they can be mapped to a finite set of keys in the integer range (0...N) without changing their sort order: simply sort the float keys, and then enumerate them. The result is a set of integer keys with the identical sort order as the float keys. The reverse mapping is just as simple.
In other words: the implementations are identical with respect to the sort. The sort is the only re-ordering performed, and so is responsible for any bias in the resulting shuffle. And therefore, if there's a bias in one implementation, there's a bias with the other one as well.
Regarding whether or not bias actually exists, I didn't say that the sorting implementation is definitely going to be biased, merely that implementations which rely on a sort are potentially biased *depending on the underlying sorting algorithm*.
This was the precisely the problem in the Haskell paper - it assumed a stable sort, which introduces bias. I believe Ruby uses quicksort, which isn't stable, and so the method isn't biased. However, Ruby makes no guarantees about the underlying sorting algorithm; were it ever to change, it would potentially introduce bias into the shuffle. One such change might be to introduce a sort which has locally-stable conditions, such as a quicksort that optimises using insertion sort on small partitions.
If (as we're assuming) an unbiased shuffle is important, then it's dangerous to rely on non-guaranteed implementation details to ensure that.
It's still less effecient than a Fisher-Yates
shuffle (at least in big-O notation; as a practical matter, I'd need to see
some benchmarks)
Acceptable benchmarks always depend on your intended use, but here are some simple wall-clock ones one some large arrays (large arrays because that's the case where performance might matter). The results are that f_y takes ~75% of the time of sort in the first case, and ~60% of the time in the second case, which is nice, but not nearly what the complexity predicts (is it ever?) Makes me wonder what the difference would be if they were in C.
matthew smillie
# a = (1..100000).map { rand(100000) }
user system total real
in-place f_y 1.230000 0.010000 1.240000 ( 1.320484)
f_y_shuffle 1.370000 0.010000 1.380000 ( 1.471718)
sort_shuffle 1.870000 0.040000 1.910000 ( 2.004320)
# a = (1..1000000).map { rand(1000000) }
user system total real
in-place f_y 12.670000 0.080000 12.750000 ( 13.202975)
f_y_shuffle 13.980000 0.150000 14.130000 ( 15.091745)
sort_shuffle 23.260000 0.410000 23.670000 ( 24.704380)
require 'benchmark'
include Benchmark
class Array
def shuffle!
self.each_index { |i|
r = rand(size - i) + i
tmp = self[i]
self[i] = self[r]
self[r] = tmp
}
end
def shuffle
values_at(*(0..size-1).to_a.shuffle!)
end
def sort_shuffle
sort_by{rand}
end
end
# test array
a = (1..100000).map { rand(100000) }
bm(10) { |b|
b.report("in-place f_y") { a.shuffle! }
b.report(" f_y_shuffle") { a.shuffle }
b.report("sort_shuffle") { a.sort_shuffle }
}
a = (1..1000000).map { rand(1000000) }
bm(10) { |b|
b.report("in-place f_y") { a.shuffle! }
b.report(" f_y_shuffle") { a.shuffle }
b.report("sort_shuffle") { a.sort_shuffle }
}
···
On Jun 20, 2006, at 4:08, Daniel Martin wrote:
On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote: