How to md5 a file?

Ben_Johnson2 · 29 July 2006 08:27

Basically I want to generate an md5 hash from considerably large files
to determine if they are exactly the same. Is there a better way to do
this besides comparing md5 hashes?

Thanks for your help.

···

--
Posted via http://www.ruby-forum.com/.

Ben_Johnson2 · 29 July 2006 08:31

Ben Johnson wrote:

Basically I want to generate an md5 hash from considerably large files
to determine if they are exactly the same. Is there a better way to do
this besides comparing md5 hashes?

Thanks for your help.

I neglected to include some neccessary details, sorry about that.
Basically the reason I want to do this is so I can store the md5 in the
database and determine if I have come across this file before. So when I
receive the file again I can md5 it, query my db, and if its in my db I
know I've come across this file before.

Thanks for your help.

···

--
Posted via http://www.ruby-forum.com/\.

Interfecus · 31 July 2006 10:15

Ben Johnson wrote:

Basically I want to generate an md5 hash from considerably large files
to determine if they are exactly the same. Is there a better way to do
this besides comparing md5 hashes?

Thanks for your help.

--
Posted via http://www.ruby-forum.com/\.

I conducted a few tests to compare the performance of different
comparison methods. I tested using string comparison, the zlib
library's crc32 checksum, and the Digest::MD5 hash. The file is
iterated over in chunks and the 1K, 10K, etc refer to the size of the
chunks. There is also a whole file measure for each of them.

The test files were identical Ogg Vorbis audio files just below 8MB in
size (identical files should give worst-case performance). Times are
for 100 repetitions.

Rehearsal -------------------------------------------------------
...... removed for brevity
-------------------------------------------- total: 214.900000sec

user system total real
String 1K 13.400000 4.250000 17.650000 ( 10.612437)
String 10K 7.633333 4.716667 12.350000 ( 7.420777)
String 100K 7.616667 4.166667 11.783333 ( 7.071255)
String Whole 7.300000 6.433333 13.733333 ( 8.260925)
CRC32 1K 16.700000 4.466667 21.166667 ( 12.774677)
CRC32 10K 9.833333 4.600000 14.433333 ( 8.769574)
CRC32 100K 9.383333 4.166667 13.550000 ( 8.129907)
CRC32 Whole 9.016667 6.333333 15.350000 ( 9.221654)
MD5 1K 26.833333 4.833333 31.666667 ( 19.087961)
MD5 10K 16.133333 4.333333 20.466667 ( 12.327322)
MD5 100K 15.216667 4.083333 19.300000 ( 11.703880)
MD5 Whole 14.633333 6.333333 20.966667 ( 12.634441)

Notice that using MD5 is significantly slower than normal string
comparison. This also demonstrates that there are few performance gains
between 10KB buffers and 100KB buffers, indicating that somewhere in
the 10K range would be a good buffer size for the memory/performance
tradeoff.

Of course if you really need speed you may want to code in C and
improve these times further, but a comparison rate of almost 100MB per
second isn't too shabby.

Here's the test code for those interested:

require 'zlib'
require 'digest/md5'
require 'benchmark'

def step_blocks(file_a, file_b, block_size)
  until file_a.eof?
    a = file_a.read(block_size)
    b = file_b.read(block_size)
    yield a, b
  end
end

def test_string_equality(file_a, file_b, block_size)
  step_blocks(file_a, file_b, block_size) do |a, b|
    return false unless a == b
  end
  true
end

def test_crc32_equality(file_a, file_b, block_size)
  step_blocks(file_a, file_b, block_size) do |a, b|
    return false unless Zlib::crc32(a) == Zlib::crc32(b)
  end
  true
end

def test_md5_equality(file_a, file_b, block_size)
  step_blocks(file_a, file_b, block_size) do |a, b|
    return false unless Digest::MD5.digest(a) == Digest::MD5.digest(b)
  end
  true
end

def test_files(filename_a, filename_b, test_method, other_args)
  raise ArgumentError unless File.exists?(filename_a) &&
File.exists?(filename_b)
  return false unless File.size(filename_a) == File.size(filename_b)
  file_a = File.new(filename_a, 'r')
  file_b = File.new(filename_b, 'r')

result = send(test_method, file_a, file_b, *other_args)

file_a.close
file_b.close

result
end

FILE1 = "a.ogg"
FILE2 = "b.ogg"
REPEATS = 100

if $0 == __FILE__
  Benchmark.bmbm(20) do |x|
    x.report("String 1K") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, 1024)}}
    x.report("String 10K") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, 10240)}}
    x.report("String 100K") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, 102400)}}
    x.report("String Whole") {REPEATS.times{test_files(FILE1, FILE2,
:test_string_equality, nil)}}
    x.report("CRC32 1K") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, 1024)}}
    x.report("CRC32 10K") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, 10240)}}
    x.report("CRC32 100K") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, 102400)}}
    x.report("CRC32 Whole") {REPEATS.times{test_files(FILE1, FILE2,
:test_crc32_equality, nil)}}
    x.report("MD5 1K") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, 1024)}}
    x.report("MD5 10K") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, 10240)}}
    x.report("MD5 100K") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, 102400)}}
    x.report("MD5 Whole") {REPEATS.times{test_files(FILE1, FILE2,
:test_md5_equality, nil)}}
  end
end

Stefan_Lang · 29 July 2006 09:40

There are basically two options.

1. Read in the whole file, generate hash:

require 'digest/md5'

Digest::MD5.digest(File.read("data")) => string with binary hash
Digest::MD5.hexdigest(File.read("data")) => string with
hexadecimal digits

2. Read block-wise, save memory

require 'digest/md5'

    md5 = Digest::MD5.new
    md5.update("chunk of data")
    md5.update("another chunk of data")

md5.digest # => string with binary hash
md5.hexdigest # => string with hexadecimal digits

Hope that helps,
Stefan

···

On Saturday 29 July 2006 10:31, Ben Johnson wrote:

Ben Johnson wrote:
> Basically I want to generate an md5 hash from considerably large
> files to determine if they are exactly the same. Is there a
> better way to do this besides comparing md5 hashes?
>
> Thanks for your help.

I neglected to include some neccessary details, sorry about that.
Basically the reason I want to do this is so I can store the md5 in
the database and determine if I have come across this file before.
So when I receive the file again I can md5 it, query my db, and if
its in my db I know I've come across this file before.

Sam_Smoot · 29 July 2006 15:30

I use the SHA1 digest for this since it's "more unique", so that gives
me a warm fuzzy. The usage is the same though. Grab a digest, iterate
over the file in chunks, and update the digest. That way it doesn't
matter how large the file is. I do this on multi-gigabyte files and it
still takes less than a minute.

Dick_Davies · 31 July 2006 10:34

Ben Johnson wrote:
> Basically I want to generate an md5 hash from considerably large files
> to determine if they are exactly the same. Is there a better way to do
> this besides comparing md5 hashes?
>
> Thanks for your help.
>
> --
> Posted via http://www.ruby-forum.com/\.

I conducted a few tests to compare the performance of different
comparison methods. I tested using string comparison, the zlib
library's crc32 checksum, and the Digest::MD5 hash. The file is
iterated over in chunks and the 1K, 10K, etc refer to the size of the
chunks. There is also a whole file measure for each of them.

The test files were identical Ogg Vorbis audio files just below 8MB in
size (identical files should give worst-case performance). Times are
for 100 repetitions.

Rehearsal -------------------------------------------------------
     ...... removed for brevity
-------------------------------------------- total: 214.900000sec

                          user system total real
String 1K 13.400000 4.250000 17.650000 ( 10.612437)
String 10K 7.633333 4.716667 12.350000 ( 7.420777)
String 100K 7.616667 4.166667 11.783333 ( 7.071255)
String Whole 7.300000 6.433333 13.733333 ( 8.260925)
CRC32 1K 16.700000 4.466667 21.166667 ( 12.774677)
CRC32 10K 9.833333 4.600000 14.433333 ( 8.769574)
CRC32 100K 9.383333 4.166667 13.550000 ( 8.129907)
CRC32 Whole 9.016667 6.333333 15.350000 ( 9.221654)
MD5 1K 26.833333 4.833333 31.666667 ( 19.087961)
MD5 10K 16.133333 4.333333 20.466667 ( 12.327322)
MD5 100K 15.216667 4.083333 19.300000 ( 11.703880)
MD5 Whole 14.633333 6.333333 20.966667 ( 12.634441)

Notice that using MD5 is significantly slower than normal string
comparison. This also demonstrates that there are few performance gains
between 10KB buffers and 100KB buffers, indicating that somewhere in
the 10K range would be a good buffer size for the memory/performance
tradeoff.

Of course if you really need speed you may want to code in C and
improve these times further, but a comparison rate of almost 100MB per
second isn't too shabby.

Here's the test code for those interested:

require 'zlib'
require 'digest/md5'
require 'benchmark'

def step_blocks(file_a, file_b, block_size)
  until file_a.eof?

Won't this return true for cases where the files are of different
sizes but not necessarily identical (e.ge file_b = file_a with trailing
stuff)?

···

On 31/07/06, Timothy Goddard <interfecus@gmail.com> wrote:

    a = file_a.read(block_size)
    b = file_b.read(block_size)
    yield a, b
  end
end

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Vincent_Fourmond · 31 July 2006 11:56

Hello !

I conducted a few tests to compare the performance of different
comparison methods. I tested using string comparison, the zlib
library’s crc32 checksum, and the Digest::MD5 hash. The file is
iterated over in chunks and the 1K, 10K, etc refer to the size of the
chunks. There is also a whole file measure for each of them.

The test files were identical Ogg Vorbis audio files just below 8MB in
size (identical files should give worst-case performance). Times are
for 100 repetitions.

I’m sorry, but I don’t quite understand the importance of the tests: it is obvious that if you have both files at hand, it will be faster to compare them byte by byte, as you need anyway to read every single byte to compute the MD5 or CRC32 of a file.

The latters are much more handy when you have only one file at hand, that is one file you downloaded and one file on the remote server, whose md5 you provide. Then, you don’t need to download the file again to make sure nothing happened during the download !

Or do I completely miss your point ?

Cheers !

  Vince

Francis_Cianfrocca · 31 July 2006 11:58

Timothy Goddard wrote:

Notice that using MD5 is significantly slower than normal string
comparison. This also demonstrates that there are few performance gains
between 10KB buffers and 100KB buffers, indicating that somewhere in
the 10K range would be a good buffer size for the memory/performance
tradeoff.

I notice that MD5-generation is not twice as time-consuming as string
comparison. In fact, it's only a little more time-consuming, which was
an interesting surprise until I checked the source code and realized
that Ruby uses the C reference implementation to compute MD5.

Comparing strings is obviously the better choice for doing one-off
comparisons that won't be repeated. But for applications like
cache-management or public email systems, where you're going to be
comparing many times against the same chunk of bits, it makes more sense
to store an MD5. That way, subsequent trials only have to compute one
hash, not two.

Someone upthread suggested using SHA1 instead of MD5 for this purpose. I
haven't done the comparison in Ruby, but in C implementations, SHA1 is
just slightly slower than MD5, not enough to matter. And Ruby's SHA1
implementation is also in C.

···

--
Posted via http://www.ruby-forum.com/\.

Interfecus · 31 July 2006 10:40

Dick Davies wrote:

Won't this return true for cases where the files are of different
sizes but not necessarily identical (e.ge file_b = file_a with trailing
stuff)?

There's a separate check for file size. It checks first that both files
exist, then that they have the same size, then reads them.

···

--
Rasputin :: Jack of All Trades - Master of Nuns
http://number9.hellooperator.net/

Jan_Svitok · 31 July 2006 13:02

The choice of CRC32/md5/sha1 is a time/space vs false positive
probability trade-off.

For normal uses, CRC (32bits) + size should be enough. It has a nice
feature that it fits into a doubleword.

The advantage of md5 and sha1 is that they are one-way functions, and
that collisions are hard to find.

So,
if you need that 30% speed gain or that 12 bytes per hash,
and you don't need attack-resistance, and probability 2^-32 is low enough,
then use crc32.
if you do need attack-resistance, I would choose sha1.

···

On 7/31/06, Francis Cianfrocca <garbagecat10@gmail.com> wrote:

Timothy Goddard wrote:
> Notice that using MD5 is significantly slower than normal string
> comparison. This also demonstrates that there are few performance gains
> between 10KB buffers and 100KB buffers, indicating that somewhere in
> the 10K range would be a good buffer size for the memory/performance
> tradeoff.
>

I notice that MD5-generation is not twice as time-consuming as string
comparison. In fact, it's only a little more time-consuming, which was
an interesting surprise until I checked the source code and realized
that Ruby uses the C reference implementation to compute MD5.

Comparing strings is obviously the better choice for doing one-off
comparisons that won't be repeated. But for applications like
cache-management or public email systems, where you're going to be
comparing many times against the same chunk of bits, it makes more sense
to store an MD5. That way, subsequent trials only have to compute one
hash, not two.

Someone upthread suggested using SHA1 instead of MD5 for this purpose. I
haven't done the comparison in Ruby, but in C implementations, SHA1 is
just slightly slower than MD5, not enough to matter. And Ruby's SHA1
implementation is also in C.

--
Posted via http://www.ruby-forum.com/\.

Topic		Replies	Views
Best/better way of md5suming of really large file in ruby? ruby-talk	4	126	22 April 2009
Efficient way for comparing records between 2 large files (16 million records) ruby-talk	8	141	16 November 2012
Md5 ruby-talk	5	99	23 December 2003
Quickest way to get md5 of a file? ruby-talk	8	142	22 December 2002
Safe way to calc md5 on very large files ruby-talk	11	114	19 March 2006

How to md5 a file?

Related topics