Fast "set difference" on arrays


(Martin Pirker) #1

Hi…

I have a Ruby speed problem, maybe you have some suggestions:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early :wink:

So the built-in operator is too slow - taking advantage of my knowledge
of the sorting I handcoded a "loop { compare arr1[0] with arr2[0], either
puts or remove }"
This gets it down to 5-10s, much better, but still too slow - 1s would be
a nice target

As I’m still quite novice in Ruby (always having Hal’s book on my lap :wink: …)
how would you code a speedier solution?

(as for shell tools, I first tried the grep way, but it chokes on the size
with a “reg.exp. too large”)

Martin


(Dave Thomas) #2

Martin Pirker crf@sbox.tu-graz.ac.dfgdfhjhzjgfdfsddadshrhdrhdfdsasaff.at writes:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early :wink:

The following runs in about .5s on my pokey old box:

s1 = {}
File.foreach(ARGV[1]) {|line| s1[line] = 1}
File.foreach(ARGV[0]) {|line| puts(line) unless s1[line]}

Note that it’s doing String comparisons, not integer ones, but if both
of your files are generated the same way that won’t be a problem.

Cheers

Dave


(Evan Martin) #3

You could also do it without hashes (though that solution is clearly the
simplest). This algorithm feels to me like the merge stage of a merge
sort, sorta (if you’re not familiar with merge sort, google for it):

a1 = [1, 3, 4, 10, 13, 31, 53, 54, 55]
a2 = [3, 10, 10, 11, 13, 54, 55]

i = 0
a3 = []
a1.each { |a|
i = i + 1 while a2[i] < a

a3 << a if a2[i] > a
}

a3.each {|a| puts a}

···

On Sat, Jul 06, 2002 at 11:18:47PM +0900, Martin Pirker wrote:

I have a Ruby speed problem, maybe you have some suggestions:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early :wink:

So the built-in operator is too slow - taking advantage of my knowledge
of the sorting I handcoded a "loop { compare arr1[0] with arr2[0], either
puts or remove }"
This gets it down to 5-10s, much better, but still too slow - 1s would be
a nice target

As I’m still quite novice in Ruby (always having Hal’s book on my lap :wink: …)
how would you code a speedier solution?


Evan Martin
martine@cs.washington.edu
http://neugierig.org


(Ned Konz) #4

How fast does this go:

h2 = Hash.new(false)
IO.readlines(ARGV[1]).each { |line| h2[line] = true }
puts “h2 has #{h2.size} elements”

diff = []
IO.readlines(ARGV[0]).each { |line| diff << line unless h2[line] }

puts “diff has #{diff.size} elements”

···

On Saturday 06 July 2002 07:18 am, Martin Pirker wrote:

Hi…

I have a Ruby speed problem, maybe you have some suggestions:

given: text files with values, one per line, sorted, e.g.
10100
10234
10292

so:
arr1 = IO.readlines(file1)
arr2 = IO.readlines(file2)

arr1 consists of ~40000 lines/elements
arr2 size is ~10000

when I want to take the “set difference”, arr3 = arr1-arr2, meaning
"take all elements from arr1 which dont appear in arr2" this takes
forever - I don’t even know how long because I stopped early :wink:


Ned Konz
http://bike-nomad.com
GPG key ID: BEEA7EFE


(Joseph McDonald) #5

when I want to take the “set difference”, arr3 = arr1-arr2, meaning “take
all elements from arr1 which dont appear in arr2” this takes forever - I
don’t even know how long because I stopped early :wink:

The following runs in about .5s on my pokey old box:

s1 = {}
File.foreach(ARGV[1]) {|line| s1[line] = 1}
File.foreach(ARGV[0]) {|line| puts(line) unless s1[line]}

Note that it’s doing String comparisons, not integer ones, but if both
of your files are generated the same way that won’t be a problem.

Do you think Array#- should do the hash trick internally instead of a
complete scan x*y times ? I do.

thanks,
-joe


(ts) #6

Do you think Array#- should do the hash trick internally instead of a
complete scan x*y times ? I do.

It can't : the "hash trick" use #eql? and #hash
            Array#- use #==

pigeon% ruby -e 'a={{1=>1}=>1}; b = {1=>1}; p a[b]; p (a.keys - [b])'
nil
[]
pigeon%

Unrelated but

(as for shell tools, I first tried the grep way, but it chokes on the size
with a "reg.exp. too large")

For shell tools

  comm -23 file1 file2

Guy Decoux


(Nobuyoshi Nakada) #7

Hi,

···

At Sun, 7 Jul 2002 00:02:58 +0900, Joseph McDonald wrote:

Do you think Array#- should do the hash trick internally instead of a
complete scan x*y times ? I do.

Already Array#& and #| use the trick, so it would be nice.


Nobu Nakada