Fast "set difference" on arrays

Martin_Pirker1 · 6 July 2002 14:18

Hi…

I have a Ruby speed problem, maybe you have some suggestions:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292
…

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early

So the built-in operator is too slow - taking advantage of my knowledge
of the sorting I handcoded a "loop { compare arr1[0] with arr2[0], either
puts or remove }"
This gets it down to 5-10s, much better, but still too slow - 1s would be
a nice target

As I’m still quite novice in Ruby (always having Hal’s book on my lap …)
how would you code a speedier solution?

(as for shell tools, I first tried the grep way, but it chokes on the size
with a “reg.exp. too large”)

Martin

Dave_Thomas · 6 July 2002 14:43

Martin Pirker crf@sbox.tu-graz.ac.dfgdfhjhzjgfdfsddadshrhdrhdfdsasaff.at writes:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292
…

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early

The following runs in about .5s on my pokey old box:

s1 = {}
File.foreach(ARGV[1]) {|line| s1[line] = 1}
File.foreach(ARGV[0]) {|line| puts(line) unless s1[line]}

Note that it’s doing String comparisons, not integer ones, but if both
of your files are generated the same way that won’t be a problem.

Cheers

Dave

Evan_Martin · 6 July 2002 17:11

You could also do it without hashes (though that solution is clearly the
simplest). This algorithm feels to me like the merge stage of a merge
sort, sorta (if you’re not familiar with merge sort, google for it):

a1 = [1, 3, 4, 10, 13, 31, 53, 54, 55]
a2 = [3, 10, 10, 11, 13, 54, 55]

i = 0
a3 =
a1.each { |a|
i = i + 1 while a2[i] < a

a3 << a if a2[i] > a
}

a3.each {|a| puts a}

···

On Sat, Jul 06, 2002 at 11:18:47PM +0900, Martin Pirker wrote:

I have a Ruby speed problem, maybe you have some suggestions:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292
…

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early

So the built-in operator is too slow - taking advantage of my knowledge
of the sorting I handcoded a “loop { compare arr1[0] with arr2[0], either
puts or remove }”
This gets it down to 5-10s, much better, but still too slow - 1s would be
a nice target

As I’m still quite novice in Ruby (always having Hal’s book on my lap …)
how would you code a speedier solution?

–
Evan Martin
martine@cs.washington.edu
http://neugierig.org

Ned_Konz · 6 July 2002 17:23

How fast does this go:

h2 = Hash.new(false)
IO.readlines(ARGV[1]).each { |line| h2[line] = true }
puts “h2 has #{h2.size} elements”

diff =
IO.readlines(ARGV[0]).each { |line| diff << line unless h2[line] }

puts “diff has #{diff.size} elements”

···

On Saturday 06 July 2002 07:18 am, Martin Pirker wrote:

Hi…

I have a Ruby speed problem, maybe you have some suggestions:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292
…

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning
“take all elements from arr1 which dont appear in arr2” this takes
forever - I don’t even know how long because I stopped early

–
Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE

Joseph_McDonald · 6 July 2002 15:02

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early

The following runs in about .5s on my pokey old box:

s1 = {}
File.foreach(ARGV[1]) {|line| s1[line] = 1}
File.foreach(ARGV[0]) {|line| puts(line) unless s1[line]}

Note that it’s doing String comparisons, not integer ones, but if both
of your files are generated the same way that won’t be a problem.

Do you think Array#- should do the hash trick internally instead of a
complete scan x*y times ? I do.

thanks,
-joe

ts1 · 6 July 2002 15:30

Do you think Array#- should do the hash trick internally instead of a
complete scan x*y times ? I do.

It can't : the "hash trick" use #eql? and #hash
Array#- use #==

pigeon% ruby -e 'a={{1=>1}=>1}; b = {1=>1}; p a[b]; p (a.keys - [b])'
nil

pigeon%

Unrelated but

(as for shell tools, I first tried the grep way, but it chokes on the size
with a "reg.exp. too large")

For shell tools

comm -23 file1 file2

Guy Decoux

Nobuyoshi_Nakada · 6 July 2002 16:15

Hi,

···

At Sun, 7 Jul 2002 00:02:58 +0900, Joseph McDonald wrote:

Do you think Array#- should do the hash trick internally instead of a
complete scan x*y times ? I do.

Already Array#& and #| use the trick, so it would be nice.

–
Nobu Nakada

Topic		Replies	Views
IO.readlines is slow? ruby-talk	17	105	7 August 2002
Problem with arrays ruby-talk	1	60	8 November 2007
Advice requested on array diff ruby-talk	1	135	19 December 2004
Speeding up ruby-talk	5	58	5 December 2002
Array question ruby-talk	13	89	17 September 2008

Fast "set difference" on arrays

Related Topics