Safe way to calc md5 on very large files

Brad5 · 18 March 2006 23:43

I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}

Stephen_Waits1 · 19 March 2006 04:04

rtilley wrote:

I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}

Close.. try this..

require 'md5'
File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

--Steve

Bill_Kelly · 19 March 2006 05:45

I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}

Hi - does the file really contain text lines? Or is it a file
full of binary data. If it's a binary file, there may be no
guarantee the whole thing isn't one very long "line". In that
case I'd recommend reading it in chunks.

Untested:

md5 = Digest::MD5.new()
File.open(file, 'rb') do |io|
  while (buf = io.read(4096)) && buf.length > 0
    md5.update(buf)
  end
end

Regards,

Bill

···

From: "rtilley" <rtilley@vt.edu>

Ara.T.Howard6 · 19 March 2006 04:49

i think the OP has the right approach - note that an 'f.read' will consume
2GB. but the OP's code

   harp:~ > cat a.rb
   require 'digest/md5'
   md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line| md5 << line}
   p md5.hexdigest

will not.

regards.

-a

···

On Sun, 19 Mar 2006, Stephen Waits wrote:

rtilley wrote:

I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}

Close.. try this..

require 'md5'
File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

--Steve

--
share your knowledge. it's a way to achieve immortality.
- h.h. the 14th dali lama

Robert · 19 March 2006 12:48

io.read will return nil at EOF so your test for positive length is basically obsolete. Also, for reasons of error checking I'd place the digest creation inside the block because then the digest is never created if the file cannot be opened:

md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
while (buf = io.read(4096))
dig.update(buf)
end
dig
end

If you want to increase efficiency, you can do this, which will prevent new strings to be created as buffers all the time:

md5 = File.open(file, 'rb') do |io|
  dig = Digest::MD5.new
  buf = ""
  while io.read(4096, buf)
    dig.update(buf)
  end
  dig
end

Here's another nice variant:

md5 = File.open(file, 'rb') do |io|
  dig = Digest::MD5.new
  buf = ""
  dig.update(buf) while io.read(4096, buf)
  dig
end

Kind regards

robert

···

Bill Kelly <billk@cts.com> wrote:

From: "rtilley" <rtilley@vt.edu>

I'm calculating md5 checksums on very large files (2 GB). This is a
safe way to do so, right? Also... is the file closed when the block
exits? I'm using 'rb' as this is used on Windows and Linux computers.

md5 = Digest::MD5.new()
File.open(file, 'rb').each {|line| md5.update(line)}

Hi - does the file really contain text lines? Or is it a file
full of binary data. If it's a binary file, there may be no
guarantee the whole thing isn't one very long "line". In that
case I'd recommend reading it in chunks.

Untested:

md5 = Digest::MD5.new()
File.open(file, 'rb') do |io|
while (buf = io.read(4096)) && buf.length > 0
md5.update(buf)
end
end

Andrew_Johnson · 19 March 2006 05:08

In my reading of the OP, both the block-open and iteration are actually
desired:

  md5 = Digest::MD5.new
  File.open(file,'rb') do |ios|
    ios.each {|line| md5 << line }
  end

cheers,
andrew

···

On Sun, 19 Mar 2006 13:49:51 +0900, ara.t.howard@noaa.gov <ara.t.howard@noaa.gov> wrote:

On Sun, 19 Mar 2006, Stephen Waits wrote:

Close.. try this..

   require 'md5'
   File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

--Steve

i think the OP has the right approach - note that an 'f.read' will consume
2GB. but the OP's code

   harp:~ > cat a.rb
   require 'digest/md5'
   md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line| md5 << line}
   p md5.hexdigest

will not.

--
Andrew L. Johnson http://www.siaris.net/
What have you done to the cat? It looks half-dead.
-- Schroedinger's wife

Brad5 · 19 March 2006 14:48

Robert Klemme wrote:

io.read will return nil at EOF so your test for positive length is basically obsolete. Also, for reasons of error checking I'd place the digest creation inside the block because then the digest is never created if the file cannot be opened:

md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
while (buf = io.read(4096))
dig.update(buf)
end
dig
end

Thank you Robert, Billy and others! Your suggestions have helped me to solve the problem.

Tanaka_Akira1 · 19 March 2006 15:21

In article <48526dFif9i5U1@individual.net>,
"Robert Klemme" <bob.news@gmx.net> writes:

md5 = File.open(file, 'rb') do |io|
  dig = Digest::MD5.new
  buf = ""
  while io.read(4096, buf)
    dig.update(buf)
  end
  dig
end

Why we have no such method in the digest library?

I think it is useful enough to have in the library.

···

--
Tanaka Akira

Robert · 19 March 2006 12:38

IMHO it's a bad idea to use line oriented reading on a binary file because "lines" can be arbitrary long (i.e. the whole file in worst case). Using IO#read is much better.

Kind regards

robert

···

Andrew Johnson <ajohnson@cpan.org> wrote:

On Sun, 19 Mar 2006 13:49:51 +0900, ara.t.howard@noaa.gov > <ara.t.howard@noaa.gov> wrote:

On Sun, 19 Mar 2006, Stephen Waits wrote:

Close.. try this..

   require 'md5'
   File.open(filename,'rb') { |f| MD5.hexdigest(f.read) }

And yes, the file is closed with the block form of open.

--Steve

i think the OP has the right approach - note that an 'f.read' will
consume 2GB. but the OP's code

   harp:~ > cat a.rb
   require 'digest/md5'
   md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line|
   md5 << line} p md5.hexdigest

will not.

In my reading of the OP, both the block-open and iteration are
actually desired:

md5 = Digest::MD5.new
File.open(file,'rb') do |ios|
   ios.each {|line| md5 << line }
end

Ara.T.Howard6 · 19 March 2006 16:54

indeed. in fact this seems a good candidate to add a method to a base class:

     harp:~ > cat a.rb
     require 'digest/md5'
     require 'digest/rmd160'
     require 'digest/sha1'
     require 'digest/sha2'

···

On Mon, 20 Mar 2006, Tanaka Akira wrote:

Why we have no such method in the digest library?

I think it is useful enough to have in the library.

     #
     # this in digest.rb or something equiv
     #
       digests = %w( MD5 RMD160 SHA1 SHA256 SHA384 SHA512 )

digests.each do |d|
digest_method = d.downcase

         IO.module_eval do
           define_method(digest_method) do |*argv|
             bufsize = argv.shift || 8192
             digest = ::Digest.const_get(d).new
             buf = ''
             off = pos rescue nil
             begin
               digest.update buf while read bufsize, buf
             ensure
               seek off rescue nil
             end
             digest
           end
         end

         File.module_eval do
           singleton_class = class << self; self; end
           singleton_class.module_eval do
             define_method(digest_method) do |path, *argv|
               mode = argv.shift || 'r'
               open(path, mode){|f| f.send digest_method}
             end
           end
         end
       end

     #
     # demo
     #
       report = {}
       digests.each do |d|
         digest_method = d.downcase
         report.update "File##{ digest_method}" => open(__FILE__){|f| f.send(digest_method).hexdigest}
         report.update "File.#{ digest_method}" => File.send(digest_method, __FILE__).hexdigest
       end
       require 'yaml' and y report

     harp:~ > ruby a.rb
     ---
     File.md5: 2e6c1e1c3d81a871f2c6b5099ba208f3
     File#md5: 2e6c1e1c3d81a871f2c6b5099ba208f3
     File.rmd160: 22ad54cb48f6d00ef325f1c7ff2150cf46fd250f
     File#rmd160: 22ad54cb48f6d00ef325f1c7ff2150cf46fd250f
     File.sha1: 1600889b027ced6bf95dedc9803cb7c65f5aa396
     File#sha1: 1600889b027ced6bf95dedc9803cb7c65f5aa396
     File.sha256: 38ac0f761f16a13d2f4f51a8a8c9668656d84c29b383840579a7517b69d219a9
     File#sha256: 38ac0f761f16a13d2f4f51a8a8c9668656d84c29b383840579a7517b69d219a9
     File.sha384: 5882c884ea618539da50a36bfbbd0fa0cd41bfa2ee18bce5acf45965e5582e33a1a3edd269f0e3551a9c9e5cd6e77cd1
     File#sha384: 5882c884ea618539da50a36bfbbd0fa0cd41bfa2ee18bce5acf45965e5582e33a1a3edd269f0e3551a9c9e5cd6e77cd1
     File.sha512: 3fba99ff4d98feaf760b814e9a8f245e05881da9aa19378510172d4e7cb0a10aa98b6c1d9b22d4331f3552a5899bb5545c604dfc4620665a5b6fb0d4dc2b0b78
     File#sha512: 3fba99ff4d98feaf760b814e9a8f245e05881da9aa19378510172d4e7cb0a10aa98b6c1d9b22d4331f3552a5899bb5545c604dfc4620665a5b6fb0d4dc2b0b78

comments?

-a
--
share your knowledge. it's a way to achieve immortality.
- h.h. the 14th dali lama

Erik_Veenstra2 · 19 March 2006 19:23

Why we have no such method in the digest library?

I extended the MD5 class with a class method to build an MD5
object directly from the contents of a given file.

Use it like this:

md5 = MD5.file("foo.bar")

gegroet,
Erik V. - http://www.erikveen.dds.nl/

···

----------------------------------------------------------------

require "md5"

class MD5
   def self.file(file)
     File.open(file, "rb") do |f|
       res = self.new
       while (data = f.read(4096))
         res << data
       end
       res
     end
   end
end

----------------------------------------------------------------

Brad5 · 19 March 2006 19:33

Erik Veenstra wrote:

Why we have no such method in the digest library?

I extended the MD5 class with a class method to build an MD5
object directly from the contents of a given file.

Should this be done to sha1, sha2, etc?

···

Use it like this:

md5 = MD5.file("foo.bar")

gegroet,
Erik V. - http://www.erikveen.dds.nl/

----------------------------------------------------------------

require "md5"

class MD5
   def self.file(file)
     File.open(file, "rb") do |f|
       res = self.new
       while (data = f.read(4096))
         res << data
       end
       res
     end
   end
end

----------------------------------------------------------------

Topic		Replies	Views
Md5 ruby-talk	5	99	23 December 2003
Best/better way of md5suming of really large file in ruby? ruby-talk	4	126	22 April 2009
How to md5 a file? ruby-talk	9	123	31 July 2006
Reading x bytes at a time ruby-talk	2	134	19 August 2008
Quickest way to get md5 of a file? ruby-talk	0	108	22 December 2002

Safe way to calc md5 on very large files

Related topics