Hash-ish and Arrays and Duplicates

Hi,

How would you do this the ruby-way ?

I want to program a little tool with ruby, which find duplicate files
on filesystems.

Currently I have a hash, which key is the fully qualified
pathname/file and its value is the md5sum (by the way: Is there a
md5sum-generator inside Ruby ?) of that file.

It is possible, that a certain md5sum is stored multiple times as
hash value.

To find, whether there are duplicated files at all I did (fphash is
the hash mentioned above):

ifphash=fphash.invert
if( ifphash.length != fphash.length )
p "Duplicates indicated!"
end

This is quite simple, a little brute force and fast.

But now the tricky part, for which I do not have a solution beside
the step by step method, which works with nearly every programming
language.

WHICH files are duplicated ?

The “invert” deletes the information except of one file, since all
now duplicated keys are deleted.

fparr=fphash.values
fparr.uniq!

did quite the same thing.

Methods like has_value? and has_key? only returns a boolean value but
not the corresponding item(s).

Is there a method or ruby-“trick” to solve the problem ?

Thank you very much in advance for any help and/or hint ! :slight_smile:

Have a nice weekend!
Meino Cramer

   Dont worry, be ruby!

   I like rubymental programming !

Meino Christian Cramer Meino.Cramer@gmx.de skrev den Sat, 6 Sep 2003
16:59:21 +0900:

Is there a method or ruby-“trick” to solve the problem ?
Thank you very much in advance for any help and/or hint ! :slight_smile:

There are shorter solutions but here is one pretty clear one:

class Hash
def invert_with_duplicates
nh = Hash.new {Array.new}
self.each {|k,v| nh[v] = (nh[v].push k)}
nh
end
end
h = {:a => 1, :b => 1, :c => 2}
p h.invert_with_duplicates # => {1=>[:a, :b], 2=>[:c]}

Regards,

/Robert Feldt

Meino Christian Cramer Meino.Cramer@gmx.de writes:

I want to program a little tool with ruby, which find duplicate files
on filesystems.

How about this ?

$ find some_dir -type f -print0 | xargs -0 md5sum | sort | uniq -D -w32

Quite short, but not Ruby as you requested :wink:

/Johan Holmberg

Currently I have a hash, which key is the fully qualified
pathname/file and its value is the md5sum (by the way: Is there a
md5sum-generator inside Ruby ?) of that file.

Key = md5sum, values = array of filenames
(feed with “hash[md5(filecontents)] << filename”)
this way you know the duplicates names and can find if they are duplicated by
doing something like

hash.each { |x| puts “dups” if x.size > 1 }

It may be not quite as fast esp. if there are no duplicates, but should work.

···


keep in touch. berkus. – http://lye.upnet.ru/

Robert Feldt feldt@ce.chalmers.se skrev den Sat, 6 Sep 2003 17:26:38
+0900:

self.each {|k,v| nh[v] = (nh[v].push k)}

even clearer:

self.each {|k,v| nh[v] <<= k }

/RF

Hi Robert !

Thank you for your reply !

This stuff is, what I as a ruby newbie need ! Great ! Thanks ! :O)

Have a nice weekend !
Meino

···

From: Robert Feldt feldt@ce.chalmers.se
Subject: Re: Hash-ish and Arrays and Duplicates
Date: Sat, 6 Sep 2003 17:29:30 +0900

Robert Feldt feldt@ce.chalmers.se skrev den Sat, 6 Sep 2003 17:26:38
+0900:

self.each {|k,v| nh[v] = (nh[v].push k)}

even clearer:

self.each {|k,v| nh[v] <<= k }

/RF

Meino Christian Cramer Meino.Cramer@gmx.de skrev den Sat, 6 Sep 2003
19:58:20 +0900:

This stuff is, what I as a ruby newbie need ! Great ! Thanks ! :O)

Np.

BTW, you can do MD5 in Ruby with:

require ‘digest/md5’
digester = Digest::MD5.new
digester.update(aString)
p digester.hexdigest # gives human-readable digest
p digester.digest # the raw (shorter) digest

or the shorter

digest = Digest::MD5.new(aString).digest

but it will be somewhat slower than md5sum though. Not
that you’ll notice unless you have really large files…

Regards,

Robert