Is it more convenient to have a random_each in ruby stdlib?

All of a sudden I'm beginning to get major performance problems when calling #shuffle!. Am I the only one who's experiencing this?

It does so on both 1.8 and 1.9.

Hmm, I'd better investigate further.

Quoting Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>:

Note that the discussion there does not apply to the implementation quoted, but rather to this implementation:

a.sort_by{rand(a.size)}.each {|e| print e , ' '}

which I think we can all see the problem with. Absent any collisions in the values chosen by rand each time through the block, the short implementation quoted above is unbiased.

You don't provide anything to back up this assertion. What makes
this implementation unbiased? What makes the other implementation
biased? What's the crucial difference between the two?

I'm sorry; I thought it would be obvious to anyone who had read the Haskell
text. In any case, the problem is the likelihood of collisions. When you use
just {rand}, the chance of collisions is very low, on the order of 2**(- how
many bits are in the random number implementation). When you use
{rand(a.size)} the chance of collision is comparatively high.

When keys collide, then the search order you get depends on the implementation
of the sorting algorithm; note that it doesn't matter whether or not the sort
is a stable sort - you get whatever order the sort function ends up with when
given identical keys. (note that it appears that ruby's sort_by is stable, but
that's irrelevant; you could replace "sort_by" with "reverse.sort_by" and you'd
still get bias, just bias in a different way)

When keys don't collide, the sorting algorithm is irrelevant. (well, so long as
the algorithm works at all) All sorting algorithms will end up putting the
list in the same order given that collection of keys.

Now, some empirical numbers as evidence. First let's try the algorithm I claim
is unbiased. Let's see what the first number is when shuffling (1..3):

irb(main):024:0> a = Hash.new(0); 99999.times { a[(1..3).sort_by{rand}[0]] +=
1}; a
=> {1=>33348, 2=>33393, 3=>33258}

Note that each value has approximately the same likelihood, as we would expect.
I suppose we could go and do a chi-squared test on the results, but I'm willing
to just eyeball it and say that that's a good result. I'll note that the
average initial value is 1.99909999099991, which is really close to 2.

Now let's try the algorithm which the Haskell paper was talking about, and which
I said was biased:

irb(main):028:0> a = Hash.new(0); 99999.times { a[(1..3).sort_by{rand(3)}[0]] +=
1}; a
=> {1=>52017, 2=>18490, 3=>29492}

Note how comparatively unlikely it is to get a 2 first. The average initial
value is around 1.77.

Note that this bias lessens as the array (and therefore, the pool of keys) grows
larger. The Haskell paper, when demonstrating bias, used an array of size two.

        For sorting purposes, though, the particular range of
the keys is irrelevant, all that matters is that the keys are
comparable (you could use strings if you felt like it, too).

Except, as I've pointed out, no. It's all about how likely you are to get key
collisions. With a limited integer range to choose from, the chance is
comparatively high.

Regarding whether or not bias actually exists, I didn't say that the
sorting implementation is definitely going to be biased, merely that
implementations which rely on a sort are potentially biased
*depending on the underlying sorting algorithm*.

Actually, the sort_by shuffling algorithm with a high chance of key collision is
biased regardless of the details of the underlying sorting algorithm. Let's
take a two element list, as the haskell paper did, and consider two possible
sorting algorithms for sorting two-element lists:

ALGORITM A: if a[0].key >= a[1].key then a.reverse else a end
ALGORITM B: if a[0].key > a[1].key then a.reverse else a end

In fact, every sorting algorithm is going to be equivalent to one of these when
applied to a two element list. (because, given equal keys, there are only two
choices for what to do with the list)

Now, let's look at what happens when we use %w(x y).sort_by{rand(2)}:

1/4 of the time: The keys chosen are [0,1] - both algorithms produce %w(x y)
1/4 of the time: The keys chosen are [1,0] - both algorithms produce %w(y x)
1/2 of the time: The keys chosen are the same.
                  Algorithm A produces %w(y x)
                  Algorithm B produces %w(x y)

That is, a system with an algorithm A-type sort will produce %w(y x)
three-fourths of the time. A system with an algorithm B-type sort will produce
%w(x y) three-fourths of the time. Both systems are biased, and only algorithm
B-type systems have a stable sort.

Note that if we threw out and re-shuffled all the cases where there's any key
collision, we'd have an unbiased shuffle regardless of sorting algorithms.
We'd also have to reshuffle a two-element list an average of two times each
time we wanted a proper shuffle, a three-element list 4.5 times on average, a
four element list 10.67 times, and an n-element list n**n / n! times. Not
something you want to do.

It's still less effecient than a Fisher-Yates
shuffle (at least in big-O notation; as a practical matter, I'd need to see
some benchmarks)

Acceptable benchmarks always depend on your intended use, but here
are some simple wall-clock ones one some large arrays (large arrays
because that's the case where performance might matter). The results
are that f_y takes ~75% of the time of sort in the first case, and
~60% of the time in the second case, which is nice, but not nearly
what the complexity predicts (is it ever?) Makes me wonder what the
difference would be if they were in C.

Ew. I meant "efficient". In any case, it's nice to see results that agree with
the theory. I has suspected that the sort_by method might prove faster because
a large proportion of its computation is done in C (the actual sort) as
compared to the all-ruby F-Y shuffle.

I gotta get a ruby translation of my sig.

···

On Jun 20, 2006, at 4:08, Daniel Martin wrote:

--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/

Daniel Schierbeck wrote:

All of a sudden I'm beginning to get major performance problems when calling #shuffle!. Am I the only one who's experiencing this?

It does so on both 1.8 and 1.9.

Hmm, I'd better investigate further.

Never mind, it seemed to go away when I moved the code into its own file. Weird...

Though this still relies on Array#shuffle!, which I can't say is a good or bad thing.

Since #shuffle! would alter the receiver, it shouldn't be defined in Enumerable. Other ordered collections could implement their own versions.

I agree with you there, but it's not quite what I meant. More clearly, Enumerable#entries returns an array with no guarantees about the ordering. Enumerable#sort and #shuffle return similar arrays, but include guarantees about the ordering. Enumerable#shuffle uses Array#shuffle! internally to produce the final array, which just doesn't quite click for me. It's just an aesthetic disagreement, and I can't quite put my finger on anything concrete.

I was wondering, is Ruby's internal variable "switching" faster than using a temperary variable? #shuffle! can be written like this, but I don't know if it's faster.

My hunch is that even if we could reliably measure a difference in a benchmark, it would disappear into the static in a practical system. At least, I'd hope that's the case - one being significantly faster than the other would set off some alarm bells for me about Ruby's implementation.

matthew smillie.

Matthew Smillie wrote:

Since #shuffle! would alter the receiver, it shouldn't be defined in Enumerable. Other ordered collections could implement their own versions.

I agree with you there, but it's not quite what I meant. More clearly, Enumerable#entries returns an array with no guarantees about the ordering. Enumerable#sort and #shuffle return similar arrays, but include guarantees about the ordering. Enumerable#shuffle uses Array#shuffle! internally to produce the final array, which just doesn't quite click for me. It's just an aesthetic disagreement, and I can't quite put my finger on anything concrete.

I can see what you mean -- you'd rather have #shuffle! depend on #shuffle than the other way round? I.e.

   class Array
     def shuffle!
       replace(shuffle)
     end
   end

I guess that could be possible.

   module Enumerable
     def shuffle
       ary = entries
       ary.each_index do |i|
         r = rand(ary.size - 1) + i
         ary[i], ary[r] = ary[r], ary[i]
       end
     end
   end

Of course, it would not be much different than `entries.shuffle!'

Cheers,
Daniel