Sorting by rand?

Browsing something at a bookstore recently, I saw an example of
an array shuffle allegedly implemented with
  x.sort { rand }
or something to that effect.

This seems sort of unlikely, to me, to be a correct shuffling
algorithm. I'm not even sure it's safe at all. I know that
doing the same thing to C qsort implementations occasionally
causes one to blow up spectacularly, because its assumption
that the comparison between any two items is constant gets
broken.

So, two vaguely related questions:
1. Is there a reasonably sane way to do this, should I ever find
myself wanting one?
2. Am I right in guessing that handing garbage to Array#sort
is probably crazy?

-s

Short answer: array.sort_by { rand }

Longer answer: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/168963

Gary Wright

···

On Apr 29, 2007, at 10:01 PM, Peter Seebach wrote:

So, two vaguely related questions:
1. Is there a reasonably sane way to do this, should I ever find
myself wanting one?

I believe that the "something to that effect was probably:

     x.sort_by {rand}

sort_by takes a block which normally takes one argument. sort_by
calls this block once for each element of the enumeration and
associates that value with the element, it then sorts those values and
returns an array of the asssociated elements in the order of the
values. So for example:

(1..50).sort_by{|e| -e}
    ==>[50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36,
35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19,
18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

The block without the argument just assignes a random value to each element.

Perfectly safe as far as I can see.

x.sort {rand}

doesn't really work, whether or not it's safe. Rand without an
argument returns a float >= 0.0 and < 1.0, the block for sort is
expected to return -1, 0, or 1. Depending on the implementaiton of the
particular sort method, you might get different results. For arrays,
it looks like you'll get the same results with repeated calls:

irb(main):020:0> (1..10).to_a.sort {rand}
=> [10, 6, 1, 7, 3, 8, 5, 9, 4, 2]
irb(main):021:0> (1..10).to_a.sort {rand}
=> [10, 6, 1, 7, 3, 8, 5, 9, 4, 2]

Whereas sort_by seems to do what's expected.

irb(main):022:0> (1..10).to_a.sort_by {rand}
=> [5, 9, 2, 1, 7, 6, 4, 10, 8, 3]
irb(main):023:0> (1..10).to_a.sort_by {rand}
=> [4, 8, 5, 6, 2, 3, 10, 7, 1, 9]

···

On 4/29/07, Peter Seebach <seebs@seebs.net> wrote:

Browsing something at a bookstore recently, I saw an example of
an array shuffle allegedly implemented with
        x.sort { rand }
or something to that effect.

This seems sort of unlikely, to me, to be a correct shuffling
algorithm. I'm not even sure it's safe at all. I know that
doing the same thing to C qsort implementations occasionally
causes one to blow up spectacularly, because its assumption
that the comparison between any two items is constant gets
broken.

So, two vaguely related questions:
1. Is there a reasonably sane way to do this, should I ever find
myself wanting one?
2. Am I right in guessing that handing garbage to Array#sort
is probably crazy?

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

Peter Seebach wrote:

Browsing something at a bookstore recently, I saw an example of
an array shuffle allegedly implemented with
  x.sort { rand }
or something to that effect.

This seems sort of unlikely, to me, to be a correct shuffling
algorithm. I'm not even sure it's safe at all. I know that
doing the same thing to C qsort implementations occasionally
causes one to blow up spectacularly, because its assumption
that the comparison between any two items is constant gets
broken.

So, two vaguely related questions:
1. Is there a reasonably sane way to do this, should I ever find
myself wanting one?
2. Am I right in guessing that handing garbage to Array#sort
is probably crazy?

-s

Welll ... I do this in Excel all the time -- add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you've got it.

Well, okay. It works in practice. Is it guaranteed to work, and is
it actually reasonable to expect the output to be randomly shuffled?

Just for comparison, I tried:

l = a.length
a.each_index { |x| a = a[rand(l)] }

I'm pretty sure I got the generic shuffle algorithm right (Knuth FTW),
and if so, it produces quality random results in about 1/4 the time.

-s

···

In message <40D2708E-0D0C-42E4-82B3-E31D93481880@mac.com>, Gary Wright writes:

On Apr 29, 2007, at 10:01 PM, Peter Seebach wrote:

So, two vaguely related questions:
1. Is there a reasonably sane way to do this, should I ever find
myself wanting one?

Short answer: array.sort_by { rand }

Longer answer: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-
talk/168963

In message <deb2337a0704291937g7384a9bej381a99282460dc07@mail.gmail.com>, "Rick DeNatale" write
s:

I believe that the "something to that effect was probably:

    x.sort_by {rand}

Yes.

Perfectly safe as far as I can see.

Ahh! Yes. Profiling it reveals this; it only calls rand once per
element. Sane, and efficient.

x.sort {rand}

I think I've seen also sort {rand <=> rand}, which is crazy.

-s

It seems expensive.

-s

···

In message <46358854.30705@cesmail.net>, "M. Edward (Ed) Borasky" writes:

Welll ... I do this in Excel all the time -- add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you've got it.

You are trashing your original array.

irb(main):001:0> a = [1,2,3,4]
=> [1, 2, 3, 4]
irb(main):002:0> l = a.length
=> 4
irb(main):003:0> a.each_index { |x| a = a[rand(l)] }
=> [3, 3, 3, 4]

···

On Apr 29, 2007, at 10:28 PM, Peter Seebach wrote:

l = a.length
a.each_index { |x| a = a[rand(l)] }

>
>> So, two vaguely related questions:
>> 1. Is there a reasonably sane way to do this, should I ever find
>> myself wanting one?
>
>Short answer: array.sort_by { rand }
>
>Longer answer: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-
>talk/168963

Well, okay. It works in practice. Is it guaranteed to work, and is
it actually reasonable to expect the output to be randomly shuffled?

See my longer explanation of sort_by which crossed in the e-mail.

Just for comparison, I tried:

l = a.length
a.each_index { |x| a = a[rand(l)] }

I'm pretty sure I got the generic shuffle algorithm right (Knuth FTW),
and if so, it produces quality random results in about 1/4 the time.

Looks suspicious to me, you are replacing each element of x in turn
with a randomly selected element, which means that you are likely to
end up with a different set of elements in the end. One of the
requirements of shuffling an array would seem to be that

   shuffle(a).sort == a.sort

Something like:

   a.each_index {|x| y=rand; a, a[y] = a[y], a}

would seem to be a better approach, but a.sort_by {rand} does the same
thing more concisely.

···

On 4/29/07, Peter Seebach <seebs@seebs.net> wrote:

In message <40D2708E-0D0C-42E4-82B3-E31D93481880@mac.com>, Gary Wright writes:
>On Apr 29, 2007, at 10:01 PM, Peter Seebach wrote:

--
Rick DeNatale

My blog on Ruby
http://talklikeaduck.denhaven2.com/

Peter Seebach wrote:

l = a.length
a.each_index { |x| a = a[rand(l)] }

I'm pretty sure I got the generic shuffle algorithm right (Knuth FTW),
and if so, it produces quality random results in about 1/4 the time.

-s

The above probably has a typo or something.

This one's from Cormen, Leiserson, Rivest, and Stein. It swaps each element with a random element in the rest of the list (allowing it to be swapped with itself sometimes). I don't have the book in front of me, but I think it's something like:

   def randomize!(ary)
     for i in 0...(ary.length)
       j = i+rand(ary.length - i)
       # swap ary[i] and ary[j]
       tmp = ary[i]
       ary[i] = ary[j]
       ary[j] = tmp
     end
     ary
   end

I think I can prove that it's a valid sort, but I don't trust my own logic that much. I tested it with 1000 items and it seemed to have the right number of each item.

>> a = [0, 1, 2, 3, 4]
=> [0, 1, 2, 3, 4]
>> randomized =
=>
>> 1000.times do randomized << randomize!(a.dup)
>> end
=> 1000
>> count = 0
=> 0
>> randomized.each {|ary| count += 1 if ary == [3, 1, 2, 0, 4]}
>> count
=> 8

So for one of the 120 permutations of [0, 1, 2, 3, 4], I counted the occurrences of one of them. There were 8. On average, if there are 120 permutations, there will be 8 of each:

>> 1000/120
=> 8

Good enough for me, unless I was just really lucky. That happens with random numbers, you know.

Dan

Peter Seebach wrote:

···

In message <46358854.30705@cesmail.net>, "M. Edward (Ed) Borasky" writes:

Welll ... I do this in Excel all the time -- add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you've got it.

It seems expensive.

That's almost exactly what sort_by{rand} does...

--
Alex

True.

That said, with the apparent speed difference, it's probably worth
copying and then shuffling.

-s

···

In message <55DCC0B4-DF5C-4D70-99E8-78D1572DDD16@sbcglobal.net>, Gary Wright writes:

On Apr 29, 2007, at 10:28 PM, Peter Seebach wrote:

l = a.length
a.each_index { |x| a = a[rand(l)] }

You are trashing your original array.

In message <deb2337a0704291945w6d65facai51ce887daae4a9dc@mail.gmail.com>, "Rick DeNatale" write
s:

Looks suspicious to me, you are replacing each element of x in turn
with a randomly selected element, which means that you are likely to
end up with a different set of elements in the end.

DOH!

  a.each_index {|x| y=rand; a, a[y] = a[y], a}

would seem to be a better approach, but a.sort_by {rand} does the same
thing more concisely.

Yes.

However, the sort is doing a lot more comparisons and swaps. It's
O(n*log(n)), while the straight shuffle is O(n). On an array of
10,000 items, shuffling uses 20,000 assignments. Sorting by rand
used 132,000 comparisons, and that would imply a fair number of
swaps. Total time difference is about a factor of three.

I'm willing to trade a little conciseness for that.

-s

Right.

And so far as I can tell, sort_by{rand} is inefficient compared to
a canonical shuffle algorithm. That'd make sense; it's going to do a lot
of compares, and some fair number of them (I'd guess half, or close to
it?) result in swaps. A shuffle does N swaps, or 2N assignments. A
sort might do more. Sorting a list of 10,000 items, sorted by rand,
performed about 128,000 comparisons. If half of them result in swaps,
that's 64k swaps.

I can't think of any way in which sort_by{rand} is going to be "more random"
than a correct shuffle algorithm (yes, I know the one I posted the first time
is not that correct algorithm; I didn't do swaps, just assignment).

I also can't see an immediate reason to do the random numbers sort more
than once; I guess to reduce the chances of collisions preserving order.

-s

···

In message <4635948C.5050803@blackkettle.org>, Alex Young writes:

Peter Seebach wrote:

In message <46358854.30705@cesmail.net>, "M. Edward (Ed) Borasky" writes:

Welll ... I do this in Excel all the time -- add a column of random
numbers to the right of the (CSV) file I want to shuffle, sort it about
five times, hitting F9 before each sort to change the random numbers,
and then deleting the random column and saving the file. Translate that
to Ruby and I think you've got it.

It seems expensive.

That's almost exactly what sort_by{rand} does...

a.each_index { |x| r= rand(l); a, a[r] = a[r], a }

···

On Monday 30 April 2007 02:55, Dan Zwell wrote:

> l = a.length
> a.each_index { |x| a = a[rand(l)] }
>

--
Marcin Raczkowski
---
Friends teach what you should know
Enemies Teach what you have to know

Peter Seebach wrote:

It seems expensive.

That's almost exactly what sort_by{rand} does...

Right.

And so far as I can tell, sort_by{rand} is inefficient compared to
a canonical shuffle algorithm.

That's true, but keep in mind that sort_by{rand} is code that's built into the interpreter, so it's still gonna be pretty fast. I tested this on my system a few weeks ago, and rerunning the benchmark program, the traditional sort only becomes faster when the arrays in question have over 5,000 elements. After all, it's a better algorithm, but it's written in ruby and not C.

I also can't see an immediate reason to do the random numbers sort more
than once; I guess to reduce the chances of collisions preserving order.

If the numbers that rand() generates are good enough, the chance of collisions preserving order should be the same as the chance of the next iteration of randomizing putting them back in the same order, shouldn't it?

Dan

Yeah. I can't even play the "only been using this language three
days" card, that was just plain dumb. It's examples like that that have
made unit testing the best thing that ever happened to me. I remembered
that the key component was putting a random thing in each spot...

Thinking about it, I think that Knuth proved that it doesn't matter
whether you select "any random element" or "any random element from the rest
of the list" when shuffling. At least, I *think* that was in ACP.
I could be wrong; I probably read it fifteen years ago.

-s

···

In message <200704300827.59213.swistak@mailx.expro.pl>, Marcin Raczkowski writes:

On Monday 30 April 2007 02:55, Dan Zwell wrote:

> l = a.length
> a.each_index { |x| a = a[rand(l)] }

a.each_index { |x| r= rand(l); a, a[r] = a[r], a }

Marcin Raczkowski wrote:

···

On Monday 30 April 2007 02:55, Dan Zwell wrote:

l = a.length
a.each_index { |x| a = a[rand(l)] }

a.each_index { |x| r= rand(l); a, a[r] = a[r], a }

I forgot you could do that. No wonder arrays don't have a swap method!

That's true, but keep in mind that sort_by{rand} is code that's built
into the interpreter, so it's still gonna be pretty fast. I tested this
on my system a few weeks ago, and rerunning the benchmark program, the
traditional sort only becomes faster when the arrays in question have
over 5,000 elements. After all, it's a better algorithm, but it's
written in ruby and not C.

Hmm. On my system, at only 250 elements, the shuffle comes out ahead.
Worth noting that the margin is very thin; .11 vs. .12 seconds user,
although oddly, the system time is .05 vs. .15 seconds. Might be
some underlying quirk there. Still, that's a "large" set for many
programs. On the other hand, by, say, 20k, the factor is 3-1 or more.

If the numbers that rand() generates are good enough, the chance of
collisions preserving order should be the same as the chance of the next
iteration of randomizing putting them back in the same order, shouldn't it?

I think not quite. Imagine a hypothetical pair of entities. They have a
1/bignum chance of a collision, which preserves their relative sorting.

If you do it twice, the probability that the same pair of entities collided
both times is 1/bignum^2. They might otherwise end up in the same order, or
not, but there's only a 1/bignum^2 chance that they ended up in the same
order *due to a collision*.

-s

···

In message <4635A730.20506@gmail.com>, Dan Zwell writes:

Peter Seebach wrote:

If the numbers that rand() generates are good enough, the chance of collisions preserving order should be the same as the chance of the next iteration of randomizing putting them back in the same order, shouldn't it?

I think not quite. Imagine a hypothetical pair of entities. They have a
1/bignum chance of a collision, which preserves their relative sorting.

If you do it twice, the probability that the same pair of entities collided
both times is 1/bignum^2. They might otherwise end up in the same order, or
not, but there's only a 1/bignum^2 chance that they ended up in the same
order *due to a collision*.

-s

You're right. I guess I'm just inclined to discount a such a small difference in orderings. And the difference in our benchmarks may have been due to the fact that I'm using a different algorithm (for each element i, swap i with an element from i...end). Or perhaps because I have more method calls, as I made a method Array#randomize! (that calls Array#swap! n times)

Dan