Is it more convenient to have a random_each in ruby stdlib?

I found it may be a common case that we need randomly enumerate elements
in a array or some other collections.
Please take a look at the very simple snippet of code below:

class Array
  def random_each
    a = self.dup
    n = len = self.length
    len.times do
      index = rand(n)
      yield a[index]
      a.delete_at(index)
      n -= 1
    end
  end
end

a = [1,2,3,4,5,6,7]
a.random_each {|e| print e , ' '}

result:
7 1 6 2 4 3 5

I don't whether it is proper to put this kind of simple but useful
utility in the standard library of ruby.

Thanks.

uncutstone

···

--
Posted via http://www.ruby-forum.com/.

From: uncutstone wu
Sent: Monday, June 19, 2006 9:54 AM
Subject: Is it more convenient to have a random_each in ruby stdlib?

I found it may be a common case that we need randomly
enumerate elements
in a array or some other collections.
Please take a look at the very simple snippet of code below:

class Array
  def random_each
    a = self.dup
    n = len = self.length
    len.times do
      index = rand(n)
      yield a[index]
      a.delete_at(index)
      n -= 1
    end
  end
end

a = [1,2,3,4,5,6,7]
a.random_each {|e| print e , ' '}

result:
7 1 6 2 4 3 5

I don't whether it is proper to put this kind of simple but useful
utility in the standard library of ruby.

Thanks.

uncutstone

I think it's a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ' '}

cheers

Simon

Have a look at rand.rb:
http://chneukirchen.org/blog/static/projects/rand.html

Paul.

···

On 19/06/06, uncutstone wu <uncutstone@sina.com> wrote:

I found it may be a common case that we need randomly enumerate elements
in a array or some other collections.

Kroeger, Simon (ext) wrote:

  def random_each

uncutstone

I think it's a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ' '}

It is nice to see you make it so simple!

very thanks for the reply.

uncutstone

···

--
Posted via http://www.ruby-forum.com/\.

Kroeger, Simon (ext) wrote:

From: uncutstone wu
Sent: Monday, June 19, 2006 9:54 AM

<snip>

class Array
  def random_each
    a = self.dup

<snip>

I think it's a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ' '}

For large collections, it'd be much nicer not to have to either duplicate or sort - isn't this something that quasirandom sequences are good for? Is there some GSL-guru out there who can comment?

···

--
Alex

Paul Battley wrote:

···

On 19/06/06, uncutstone wu <uncutstone@sina.com> wrote:

I found it may be a common case that we need randomly enumerate elements
in a array or some other collections.

Have a look at rand.rb:
http://chneukirchen.org/blog/static/projects/rand.html

Paul.

Thanks.

uncutstone

--
Posted via http://www.ruby-forum.com/\.

Alex Young wrote:

Kroeger, Simon (ext) wrote:
>
>
>> From: uncutstone wu
>> Sent: Monday, June 19, 2006 9:54 AM
<snip>
>> class Array
>> def random_each
>> a = self.dup
<snip>

> I think it's a little bit to simple to be put in the standard lib.
>
> a.sort_by{rand}.each {|e| print e , ' '}

For large collections, it'd be much nicer not to have to either
duplicate or sort - isn't this something that quasirandom sequences are
good for? Is there some GSL-guru out there who can comment?

Perhaps...

  class Array
     def random_each
          ((0...size).to_a.sort_by{rand}).each { |i|
            yield self[i]
          }
     end
  end

T.

Surely this would break some sort algorithms, since the same element
can return different sort priorities (using rand)... I assume it
doesn't break Ruby's sort algorithm?

···

On 6/19/06, uncutstone wu <uncutstone@sina.com> wrote:

Kroeger, Simon (ext) wrote:
>> def random_each
>>
>> uncutstone
>
> I think it's a little bit to simple to be put in the standard lib.
>
> a.sort_by{rand}.each {|e| print e , ' '}
>
It is nice to see you make it so simple!

transfire@gmail.com wrote:

Perhaps...

  class Array
     def random_each
          ((0...size).to_a.sort_by{rand}).each { |i|
            yield self[i]
          }
     end
  end

On second though, that still sorts an array of the same size. Hmm...

T.

Leslie Viljoen wrote:

···

On 6/19/06, uncutstone wu <uncutstone@sina.com> wrote:

Kroeger, Simon (ext) wrote:
>> def random_each
>>
>> uncutstone
>
> I think it's a little bit to simple to be put in the standard lib.
>
> a.sort_by{rand}.each {|e| print e , ' '}
>
It is nice to see you make it so simple!

Surely this would break some sort algorithms, since the same element
can return different sort priorities (using rand)... I assume it
doesn't break Ruby's sort algorithm?

sort_by generates the block value once for each collection element then uses that value for a standard sort, so it's not a problem.

--
Alex

I think it's a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ' '}

But it turns out to not be that simple. This can give a biased sort depending on the underlying sort algorithm, and it's not as efficient as it could be, as Alex observes:

For large collections, it'd be much nicer not to have to either duplicate or sort - isn't this something that quasirandom sequences are good for? Is there some GSL-guru out there who can comment?

A quasi-random sequence is a theoretically better flavour of random than a pseudo-random sequence in a lot of cases, but there's no direct application to this problem (apart from being used in place of Kernel#rand). It'd be overkill for all but the most specialised uses, I'd say.

Perhaps...

  class Array
     def random_each
          ((0...size).to_a.sort_by{rand}).each { |i|
            yield self[i]
          }
     end
  end

On second thought, that still sorts an array of the same size. Hmm...

What we're looking for is a shuffle; a random selection of elements from an array might not include all the elements from the source, whereas a shuffle is a permutation of the source.

Like a lot of simple-sounding problems, this one's already been solved, see:

The linked Haskell implementation also gives a very thorough discussion, including a demonstration of how a stable sorting algorithm will yield biased results. The imperative solution given in that discussion is identical to the Fisher-Yates shuffle on the NIST site, which is what I'll use here:
   - for i: 0...n, exchange each element e_i with another element e_i...e_n.

Which is the best possible case of O(n), compared to O(n lg n) when using #sort_by{rand}. This is proven to be unbiased (see the reference to Knuth given on the NIST site).

Intuitively, the best place for this seems like Enumerable, since anything that can be sorted can be shuffled. Is this the sort of thing that needs an RCR? Anyway, I doubt I'd use it myself, but I can see the following arguments for its inclusion:
  - It's easy to do wrong:
    - inefficiently and possibly biased using #sort_by.
    - biased under any other sort of element exchange:
      e.g., exchanging e_i with *any* element in the
      array instead of just the unshuffled elements
      leads to a biased shuffle.
  - It seems fairly generic.
  - People migrating from other languages may expect
    it (e.g., PHP, Java).

On the other hand, people who REALLY care about unbiased shuffles will already know how to do them.

Quick Ruby implementation for Array: In-place shuffling is the simplest case. We can use it to create a more Ruby-esque shuffle by using an approach similar to trans' one given above: shuffle an array of indices and use that as a mapping between the original and shuffled arrays.

class Array
   def shuffle!
     self.each_index { |i|
       r = rand(size - i) + i
       tmp = self[i]
       self[i] = self[r]
       self[r] = tmp
     }
   end

   # this would work if dup produced a deep copy.
   def dup_shuffle
     self.dup.shuffle!
   end

   # use an in-place shuffle of an array of indices
   def shuffle
     res =
     indices = (0...size).to_a.shuffle!
     indices.each_with_index { |x, i|
       block_given? ? yield(self) : res[i] = self
     }
     block_given? ? self : res
   end
end

matthew smillie.

···

On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:
On Jun 19, 2006, at 10:52, Alex Young wrote:
On Jun 19, 2006, at 11:29, transfire@gmail.com wrote:

Matthew Smillie wrote:

I think it's a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ' '}

But it turns out to not be that simple. This can give a biased sort depending on the underlying sort algorithm, and it's not as efficient as it could be, as Alex observes:

For large collections, it'd be much nicer not to have to either duplicate or sort - isn't this something that quasirandom sequences are good for? Is there some GSL-guru out there who can comment?

A quasi-random sequence is a theoretically better flavour of random than a pseudo-random sequence in a lot of cases, but there's no direct application to this problem (apart from being used in place of Kernel#rand). It'd be overkill for all but the most specialised uses, I'd say.

I was thinking of them more in terms of a random walk across a finite set of elements that's guaranteed to hit each element once and only once. You've given a good explanation of the shuffle principle below, and that's great for when you want to actually reorder things, but what if you want to walk across the Enumerable without modifying it? Did I really mean "perfect hash functions" when I wrote "quasirandom sequences?"

···

On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:
On Jun 19, 2006, at 10:52, Alex Young wrote:

On Jun 19, 2006, at 11:29, transfire@gmail.com wrote:

Perhaps...

  class Array
     def random_each
          ((0...size).to_a.sort_by{rand}).each { |i|
            yield self[i]
          }
     end
  end

On second thought, that still sorts an array of the same size. Hmm...

What we're looking for is a shuffle; a random selection of elements from an array might not include all the elements from the source, whereas a shuffle is a permutation of the source.

Like a lot of simple-sounding problems, this one's already been solved, see:

perfect shuffle

The linked Haskell implementation also gives a very thorough discussion, including a demonstration of how a stable sorting algorithm will yield biased results. The imperative solution given in that discussion is identical to the Fisher-Yates shuffle on the NIST site, which is what I'll use here:
  - for i: 0...n, exchange each element e_i with another element e_i...e_n.

Which is the best possible case of O(n), compared to O(n lg n) when using #sort_by{rand}. This is proven to be unbiased (see the reference to Knuth given on the NIST site).

Intuitively, the best place for this seems like Enumerable, since anything that can be sorted can be shuffled. Is this the sort of thing that needs an RCR? Anyway, I doubt I'd use it myself, but I can see the following arguments for its inclusion:
- It's easy to do wrong:
   - inefficiently and possibly biased using #sort_by.
   - biased under any other sort of element exchange:
     e.g., exchanging e_i with *any* element in the
     array instead of just the unshuffled elements
     leads to a biased shuffle.
- It seems fairly generic.
- People migrating from other languages may expect
   it (e.g., PHP, Java).

On the other hand, people who REALLY care about unbiased shuffles will already know how to do them.

Quick Ruby implementation for Array: In-place shuffling is the simplest case. We can use it to create a more Ruby-esque shuffle by using an approach similar to trans' one given above: shuffle an array of indices and use that as a mapping between the original and shuffled arrays.

class Array
  def shuffle!
    self.each_index { |i|
      r = rand(size - i) + i
      tmp = self[i]
      self[i] = self[r]
      self[r] = tmp
    }
  end

  # this would work if dup produced a deep copy.
  def dup_shuffle
    self.dup.shuffle!
  end

  # use an in-place shuffle of an array of indices
  def shuffle
    res =
    indices = (0...size).to_a.shuffle!
    indices.each_with_index { |x, i|
      block_given? ? yield(self) : res[i] = self
    }
    block_given? ? self : res
  end
end

matthew smillie.

--
Alex

Matthew Smillie wrote:

On the other hand, people who REALLY care about unbiased shuffles will already know how to do them.

The problem with this argument is that people might not know when they need them, and their not being unbiased when you've implicitly made that assumption sounds like the sort of thing that would generate really subtle problems. Correct me if I'm wrong...

···

--
Alex

Hi Matthew,

I like these implementations, though I have some comments and simplifications.

Matthew Smillie wrote:

  def shuffle
    res =
    indices = (0...size).to_a.shuffle!
    indices.each_with_index { |x, i|
      block_given? ? yield(self) : res[i] = self
    }
    block_given? ? self : res
  end
end

I'm not too keen on #shuffle having a dual function as both a shuffler and an iterator -- I'd be more comfortable using `ary.shuffle.each'. That would also allow a major simplification of the implementation:

   class Array
     def shuffle
       values_at(*(0..size-1).to_a.shuffle!)
     end
   end

Cheers,
Daniel

Quoting Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>:

I think it's a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ' '}

But it turns out to not be that simple. This can give a biased sort
depending on the underlying sort algorithm, and it's not as efficient
as it could be, as Alex observes:

<<SNIP>>

Like a lot of simple-sounding problems, this one's already been
solved, see:

perfect shuffle

The linked Haskell implementation also gives a very thorough
discussion, including a demonstration of how a stable sorting
algorithm will yield biased results.

Note that the discussion there does not apply to the implementation quoted, but
rather to this implementation:

a.sort_by{rand(a.size)}.each {|e| print e , ' '}

which I think we can all see the problem with. Absent any collisions in the
values chosen by rand each time through the block, the short implementation
quoted above is unbiased. It's still less effecient than a Fisher-Yates
shuffle (at least in big-O notation; as a practical matter, I'd need to see
some benchmarks), but "biased" is not a critique that can be fairly leveled
against this implementation. (unless you're willing to accept that your F-Y
shuffle is "biased" too, because for example rand(7) doesn't return 0 with
/exactly/ one-seventh probability, but rather with a probability that's merely
very, very close to one-seventh)

···

On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:

--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/;
map{y/X_/\n /;print}map{pop@$_}@/for@/

Matthew Smillie wrote:

On the other hand, people who REALLY care about unbiased shuffles will already know how to do them.

The problem with this argument is that people might not know when they need them, and their not being unbiased when you've implicitly made that assumption sounds like the sort of thing that would generate really subtle problems. Correct me if I'm wrong...

No, you're right. The catch, as usual, comes down to something being "good enough in practice".

It's a big circular argument: if you don't know that you need a perfect shuffle, then you're unlikely to actually need one (a biased one is often good enough). If you do know you *need* a perfect shuffle, you're likely to code your own rather than relying on library code, since it's simple enough to do so.

I don't actually have much preference either way for including any sort of shuffle in Array or Enumerable or wherever, I was just presenting pros/cons I could think of.

You've given a good explanation of the shuffle principle below, and that's great for when you want to actually reorder things, but what if you want to walk across the Enumerable without modifying it? Did I really mean "perfect hash functions" when I wrote "quasirandom sequences?"

"Perfect hash functions" makes more sense to me in this context, anyway - I'm not aware that quasi-random sequences make any once-and-only-once guarantee the same way a perfect hash function does, they just guarantee a uniform distribution within a certain tolerance/discrepancy. I might be missing something subtle. Or something obvious, for that matter; I wouldn't know, since I'm the one missing it.

Anyway, that's the approach I used in Array#shuffle. The shuffled array of indices defines a perfect hash function for the original array: each key/index in the original array is mapped to a distinct integer; they key/index in the shuffled array of indices. Illustration:

original array:
    arr = ['a', 'b', 'c', 'd']

shuffled array of indices:
    [2, 0, 1, 3]

implicit perfect hash function:
    0 => 2, 1 => 0, 2 => 1, 3 => 3

application of the function to arr[0..size]:
    ['c', 'a', 'b', 'd']

You can see in the implementation that providing a block parameter to Array#shuffle does walk across the entire Array without modifying it. Without the block parameter, it returns a new, shuffled Array.

matthew smillie.

···

On Jun 19, 2006, at 14:46, Alex Young wrote:

I'm not too keen on #shuffle having a dual function as both a shuffler and an iterator -- I'd be more comfortable using `ary.shuffle.each'. That would also allow a major simplification of the implementation:

  class Array
    def shuffle
      values_at(*(0..size-1).to_a.shuffle!)
    end
  end

I agree with you about 'arr.shuffle.each' being better semantically, but I was worried about efficiency; since the thread started out with concerns about shuffling really big arrays, I wanted to avoid creating an intermediate copy of the array. I didn't really think about it that hard, though, because thinking about it now, I realise an intermediate copy isn't all that bad (being just references), so I'm not sure the savings would matter all that much.

Anyway, Array's just a special case: I'd say that it belongs in Enumerable along with sort, for which I'd grab your implementation.

module Enumerable
   def shuffle
     entries.values_at(*(0...entries.size).entries.shuffle!)
   end
end

Though this still relies on Array#shuffle!, which I can't say is a good or bad thing. Bit of a moot question unless someone does decide to put it into core or stdlib, though.

matthew smillie.

Quoting Matthew Smillie <M.B.Smillie@sms.ed.ac.uk>:

I think it's a little bit to simple to be put in the standard lib.

a.sort_by{rand}.each {|e| print e , ' '}

But it turns out to not be that simple. This can give a biased sort depending on the underlying sort algorithm,

The linked Haskell implementation also gives a very thorough discussion, including a demonstration of how a stable sorting algorithm will yield biased results.

Note that the discussion there does not apply to the implementation quoted, but rather to this implementation:

a.sort_by{rand(a.size)}.each {|e| print e , ' '}

which I think we can all see the problem with. Absent any collisions in the values chosen by rand each time through the block, the short implementation quoted above is unbiased.

You don't provide anything to back up this assertion. What makes this implementation unbiased? What makes the other implementation biased? What's the crucial difference between the two?

(technical note - #sort_by calculates the key assigned to each element only once. In other words, each element of 'a' is assigned one and only one random key, and is then sorted according to that key. See the docs for more info.)

The only difference between the two implementations is the range of the random numbers - the first implementation uses a float in the range [0, 1.0), the second is an integer from in the range (0...a.size). For sorting purposes, though, the particular range of the keys is irrelevant, all that matters is that the keys are comparable (you could use strings if you felt like it, too).

This is proven very simply: given a finite set of N keys in the float range [0, 1.0), they can be mapped to a finite set of keys in the integer range (0...N) without changing their sort order: simply sort the float keys, and then enumerate them. The result is a set of integer keys with the identical sort order as the float keys. The reverse mapping is just as simple.

In other words: the implementations are identical with respect to the sort. The sort is the only re-ordering performed, and so is responsible for any bias in the resulting shuffle. And therefore, if there's a bias in one implementation, there's a bias with the other one as well.

Regarding whether or not bias actually exists, I didn't say that the sorting implementation is definitely going to be biased, merely that implementations which rely on a sort are potentially biased *depending on the underlying sorting algorithm*.

This was the precisely the problem in the Haskell paper - it assumed a stable sort, which introduces bias. I believe Ruby uses quicksort, which isn't stable, and so the method isn't biased. However, Ruby makes no guarantees about the underlying sorting algorithm; were it ever to change, it would potentially introduce bias into the shuffle. One such change might be to introduce a sort which has locally-stable conditions, such as a quicksort that optimises using insertion sort on small partitions.

If (as we're assuming) an unbiased shuffle is important, then it's dangerous to rely on non-guaranteed implementation details to ensure that.

It's still less effecient than a Fisher-Yates
shuffle (at least in big-O notation; as a practical matter, I'd need to see
some benchmarks)

Acceptable benchmarks always depend on your intended use, but here are some simple wall-clock ones one some large arrays (large arrays because that's the case where performance might matter). The results are that f_y takes ~75% of the time of sort in the first case, and ~60% of the time in the second case, which is nice, but not nearly what the complexity predicts (is it ever?) Makes me wonder what the difference would be if they were in C.

matthew smillie

# a = (1..100000).map { rand(100000) }
                 user system total real
in-place f_y 1.230000 0.010000 1.240000 ( 1.320484)
  f_y_shuffle 1.370000 0.010000 1.380000 ( 1.471718)
sort_shuffle 1.870000 0.040000 1.910000 ( 2.004320)

# a = (1..1000000).map { rand(1000000) }
                 user system total real
in-place f_y 12.670000 0.080000 12.750000 ( 13.202975)
  f_y_shuffle 13.980000 0.150000 14.130000 ( 15.091745)
sort_shuffle 23.260000 0.410000 23.670000 ( 24.704380)

require 'benchmark'
include Benchmark

class Array
   def shuffle!
     self.each_index { |i|
       r = rand(size - i) + i
       tmp = self[i]
       self[i] = self[r]
       self[r] = tmp
     }
   end
   def shuffle
     values_at(*(0..size-1).to_a.shuffle!)
   end
   def sort_shuffle
     sort_by{rand}
   end
end

# test array
a = (1..100000).map { rand(100000) }
bm(10) { |b|
   b.report("in-place f_y") { a.shuffle! }
   b.report(" f_y_shuffle") { a.shuffle }
   b.report("sort_shuffle") { a.sort_shuffle }
}
a = (1..1000000).map { rand(1000000) }
bm(10) { |b|
   b.report("in-place f_y") { a.shuffle! }
   b.report(" f_y_shuffle") { a.shuffle }
   b.report("sort_shuffle") { a.sort_shuffle }
}

···

On Jun 20, 2006, at 4:08, Daniel Martin wrote:

On Jun 19, 2006, at 9:08, Kroeger, Simon (ext) wrote:

Matthew Smillie wrote:

Anyway, Array's just a special case: I'd say that it belongs in Enumerable along with sort, for which I'd grab your implementation.

module Enumerable
  def shuffle
    entries.values_at(*(0...entries.size).entries.shuffle!)
  end
end

Yup, Enumerable's the place it should be.

Though this still relies on Array#shuffle!, which I can't say is a good or bad thing.

Since #shuffle! would alter the receiver, it shouldn't be defined in Enumerable. Other ordered collections could implement their own versions.

I was wondering, is Ruby's internal variable "switching" faster than using a temperary variable? #shuffle! can be written like this, but I don't know if it's faster.

   def shuffle!
     each_index do |i|
       r = rand(size - i) + i
       self[i], self[r] = self[r], self[i]
     end
   end

Cheers,
Daniel

Matthew Smillie wrote:

I agree with you about 'arr.shuffle.each' being better semantically,
but I was worried about efficiency; since the thread started out with
concerns about shuffling really big arrays, I wanted to avoid
creating an intermediate copy of the array. I didn't really think
about it that hard, though, because thinking about it now, I realise
an intermediate copy isn't all that bad (being just references), so
I'm not sure the savings would matter all that much.

Yes, since array just stores object references, (0...size).to_a is as
large as aArray.dup, they cost same time and space.

Best Regards.

uncutstone

···

--
Posted via http://www.ruby-forum.com/\.