Need some Ruby magic

Ok, I think I've beat sort_by{ rand } on large Arrays , by about 30%. I ended up with this:

     h = Hash.new
     self.each { |v| h[rand(1000000000)] = v }
     h.keys.sort.collect { |k| h[k] }

Explanation..

* Hash works better than an Array of Arrays [[rand,x[0]],[rand,x[1]],...] because Array#<=> seems to be slow.

* Got some speedup by avoiding Float#<=>, hence the Integer version of rand.

* On rand(1000000000); chose 1billion rather arbitrarily to keep things out of the range of BigNum. Oddly, using 2^31-1 caused BigNums to show up. Ideally we'd use the largest FixNum, but I don't know how to retrieve that.

* Bias should be minimal for most applications, even at rand(1Billion).

* Can it be improved further???

Find my benchmark results, and script below.

--Steve

                       user system total real
shuffle (10) 0.016000 0.000000 0.016000 ( 0.016000)
shuffle2 (10) 0.000000 0.000000 0.000000 ( 0.000000)
shuffle3 (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort (10) 0.000000 0.000000 0.000000 ( 0.000000)
sort_by (10) 0.000000 0.000000 0.000000 ( 0.000000)

                       user system total real
shuffle (100) 0.156000 0.000000 0.156000 ( 0.156000)
shuffle2 (100) 0.047000 0.000000 0.047000 ( 0.046000)
shuffle3 (100) 0.031000 0.000000 0.031000 ( 0.032000)
sort (100) 0.125000 0.000000 0.125000 ( 0.125000)
sort_by (100) 0.031000 0.000000 0.031000 ( 0.031000)

                       user system total real
shuffle (1000) 1.625000 0.000000 1.625000 ( 1.625000)
shuffle2 (1000) 0.656000 0.000000 0.656000 ( 0.672000)
shuffle3 (1000) 0.266000 0.000000 0.266000 ( 0.281000)
sort (1000) 2.094000 0.000000 2.094000 ( 2.093000)
sort_by (1000) 0.453000 0.000000 0.453000 ( 0.453000)

                       user system total real
shuffle (10000) 16.469000 0.032000 16.501000 ( 16.578000)
shuffle2 (10000) 8.156000 0.000000 8.156000 ( 8.203000)
shuffle3 (10000) 3.344000 0.047000 3.391000 ( 3.391000)
sort (10000) 25.453000 0.000000 25.453000 ( 25.593000)
sort_by (10000) 4.765000 0.000000 4.765000 ( 4.796000)

#!/usr/bin/env ruby

require 'benchmark'

class Array

   def shuffle
     newarr = self.dup
     2.times do
       self.length.times do |x|
         y = rand(self.length)
         newarr[x], newarr[y] = newarr[y], newarr[x]
       end
     end
     newarr
   end

   def shuffle2
     tmparr = self.collect { |x| [rand(1000000000),x] }
     tmparr.sort.collect { |x| x[1] }
   end

   def shuffle3
     h = Hash.new
     self.each { |v| h[rand(1000000000)] = v }
     h.keys.sort.collect { |k| h[k] }
   end

end

[10, 100, 1000, 10000].each do |n|

   # populate an Array size n
   a = Array.new(n) { |i| i }

   # run our benchmarks
   Benchmark.bm(16) do |bb|

     # my method
     bb.report("shuffle ("+n.to_s+")") do
       100.times do
         a.shuffle
       end
     end

     # decorate, sort, undecorate
     bb.report("shuffle2 ("+n.to_s+")") do
       100.times do
         a.shuffle2
       end
     end

     # decorate, sort, undecorate with a hash
     bb.report("shuffle3 ("+n.to_s+")") do
       100.times do
         a.shuffle3
       end
     end

     # Array's sort
     bb.report("sort ("+n.to_s+")") do
       100.times do
         a.sort { rand(10000) <=> 5000 }
       end
     end

     # Enumerable's sort_by
     bb.report("sort_by ("+n.to_s+")") do
       100.times do
         a.sort_by { rand }
       end
     end

     printf("\n")

   end

end

> If you use random as sorting key, the result of a<=>b
> will differ each time you evaluate a<=>b.
> In my case I attach a random number to each item,
> then I sort my list according to this number and remove
> this 'decoration' afterwards.

run

  ri Enum#sort_by

and see what it do.

Thanks. As I told you before I'm new to ruby :wink:

Greetings, Uwe

Stephen Waits wrote:

* On rand(1000000000); chose 1billion rather arbitrarily to keep things out of the range of BigNum. Oddly, using 2^31-1 caused BigNums to show up. Ideally we'd use the largest FixNum, but I don't know how to retrieve that.

2**30 - 1 is reliably a fixnum (though I guess it might not be the biggest one on some platforms).

···

--
        vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Stephen Waits wrote:

    h = Hash.new
    self.each { |v| h[rand(1000000000)] = v }
    h.keys.sort.collect { |k| h[k] }

OOPS!

I forgot to deal with collisions in the hash! :frowning:

--Steve

Stephen Waits wrote:

I forgot to deal with collisions in the hash! :frowning:

Here's another shot.. still faster than sort_by{rand} on my system, and now should be unbiased:

   def shuffle3
     h = Hash.new
     self.each do |v|
       k = rand(1000000000) until !h.has_key?(k)
       h[k] = v
     end
   end

shuffle3 (10000) 3.625000 0.016000 3.641000 ( 3.656000)
sort_by (10000) 5.250000 0.000000 5.250000 ( 5.312000)

--Steve

Stephen Waits wrote:
>I forgot to deal with collisions in the hash! :frowning:

Here's another shot.. still faster than sort_by{rand} on my system, and
now should be unbiased:

  def shuffle3
    h = Hash.new
    self.each do |v|
      k = rand(1000000000) until !h.has_key?(k)

This is not equivalent to
    begin
      k = rand(1000000000)
    end until !h.has_key?

Compare
    k = 1 until true
    k # => nil
to
    begin; k = 1 end until true
    k # => 1

      h[k] = v

You forgot
     h.keys.sort.collect { |k| h[k] }
    

    end
  end

shuffle3 (10000) 3.625000 0.016000 3.641000 ( 3.656000)
sort_by (10000) 5.250000 0.000000 5.250000 ( 5.312000)

After fixing shuffle3, I get:

module Enumerable
  def shuffle3
    h = Hash.new
    k = rand(1000000000)
    self.each do |v|
      k = rand(1000000000) until !h.has_key?(k)
      h[k] = v
    end
    h.keys.sort.collect { |k| h[k] }
  end

  def shuffle4
    h = {}
    k = rand(1000000000)
    self.sort_by{ k = rand(1000000000) while h.has_key?(k); h[k] = k }
  end
end

GC.disable
(1..10).shuffle4 # => [8, 3, 10, 2, 4, 6, 7, 5, 9, 1]
a = 1..100000
t0 = Time.new # => Tue Dec 06 22:13:56 CET 2005
a.shuffle4
(t1 = Time.new) - t0 # => 0.766411
a.shuffle3
(t2 = Time.new) - t1 # => 0.575967
a.sort_by{ rand }
Time.new - t2 # => 0.637288

···

On Wed, Dec 07, 2005 at 05:04:13AM +0900, Stephen Waits wrote:

--
Mauricio Fernandez

Mauricio Fernández wrote:

Compare
    k = 1 until true
    k # => nil
to
    begin; k = 1 end until true
    k # => 1

Oops.. I thought too much like C. Now I understand that the loop is called 0 or more times, except in the case of "begin; end while" where it's 1 or more. Thank you for teaching me!

After fixing shuffle3, I get:

I'm getting slightly different results, still better than sort_by{rand} and better than your new shuffle4, but now only on larger Arrays. I wonder too how performance might vary when Enumerable is mixed into ADTs other than Array.

--Steve

                              user system total real
shuffle3 (1000) 0.375000 0.000000 0.375000 ( 0.375000)
shuffle4 (1000) 0.484000 0.000000 0.484000 ( 0.485000)
sort_by{rand} (1000) 0.328000 0.016000 0.344000 ( 0.344000)

                              user system total real
shuffle3 (10000) 3.906000 0.031000 3.937000 ( 3.953000)
shuffle4 (10000) 5.531000 0.047000 5.578000 ( 5.594000)
sort_by{rand} (10000) 4.672000 0.109000 4.781000 ( 4.812000)

                              user system total real
shuffle3 (100000) 49.703000 0.328000 50.031000 ( 50.280000)
shuffle4 (100000) 71.938000 0.453000 72.391000 ( 73.061000)
sort_by{rand} (100000) 62.000000 0.782000 62.782000 ( 63.451000)

module Enumerable

   def shuffle3
     h = {}
     self.each do |v|
       begin
         k = rand(1000000000)
       end while h.has_key?(k)
       h[k] = v
     end
     h.keys.sort.collect { |k| h[k] }
   end

   def shuffle4
     h = {}
     k = rand(1000000000)
     self.sort_by{ k = rand(1000000000) while h.has_key?(k); h[k] = k }
   end

end