I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.
I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.
I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.
Hi - does the file really contain text lines? Or is it a file
full of binary data. If it's a binary file, there may be no
guarantee the whole thing isn't one very long "line". In that
case I'd recommend reading it in chunks.
Untested:
md5 = Digest::MD5.new()
File.open(file, 'rb') do |io|
while (buf = io.read(4096)) && buf.length > 0
md5.update(buf)
end
end
i think the OP has the right approach - note that an 'f.read' will consume
2GB. but the OP's code
harp:~ > cat a.rb
require 'digest/md5'
md5 = Digest::MD5.new() and open(ARGV.shift, 'rb').each{|line| md5 << line}
p md5.hexdigest
will not.
regards.
-a
···
On Sun, 19 Mar 2006, Stephen Waits wrote:
rtilley wrote:
I'm calculating md5 checksums on very large files (2 GB). This is a safe way to do so, right? Also... is the file closed when the block exits? I'm using 'rb' as this is used on Windows and Linux computers.
io.read will return nil at EOF so your test for positive length is basically obsolete. Also, for reasons of error checking I'd place the digest creation inside the block because then the digest is never created if the file cannot be opened:
md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
while (buf = io.read(4096))
dig.update(buf)
end
dig
end
If you want to increase efficiency, you can do this, which will prevent new strings to be created as buffers all the time:
md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
buf = ""
while io.read(4096, buf)
dig.update(buf)
end
dig
end
Here's another nice variant:
md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
buf = ""
dig.update(buf) while io.read(4096, buf)
dig
end
Kind regards
robert
···
Bill Kelly <billk@cts.com> wrote:
From: "rtilley" <rtilley@vt.edu>
I'm calculating md5 checksums on very large files (2 GB). This is a
safe way to do so, right? Also... is the file closed when the block
exits? I'm using 'rb' as this is used on Windows and Linux computers.
Hi - does the file really contain text lines? Or is it a file
full of binary data. If it's a binary file, there may be no
guarantee the whole thing isn't one very long "line". In that
case I'd recommend reading it in chunks.
Untested:
md5 = Digest::MD5.new()
File.open(file, 'rb') do |io|
while (buf = io.read(4096)) && buf.length > 0
md5.update(buf)
end
end
io.read will return nil at EOF so your test for positive length is basically obsolete. Also, for reasons of error checking I'd place the digest creation inside the block because then the digest is never created if the file cannot be opened:
md5 = File.open(file, 'rb') do |io|
dig = Digest::MD5.new
while (buf = io.read(4096))
dig.update(buf)
end
dig
end
Thank you Robert, Billy and others! Your suggestions have helped me to solve the problem.
IMHO it's a bad idea to use line oriented reading on a binary file because "lines" can be arbitrary long (i.e. the whole file in worst case). Using IO#read is much better.
Kind regards
robert
···
Andrew Johnson <ajohnson@cpan.org> wrote:
On Sun, 19 Mar 2006 13:49:51 +0900, ara.t.howard@noaa.gov > <ara.t.howard@noaa.gov> wrote:
I think it is useful enough to have in the library.
#
# this in digest.rb or something equiv
#
digests = %w( MD5 RMD160 SHA1 SHA256 SHA384 SHA512 )
digests.each do |d|
digest_method = d.downcase
IO.module_eval do
define_method(digest_method) do |*argv|
bufsize = argv.shift || 8192
digest = ::Digest.const_get(d).new
buf = ''
off = pos rescue nil
begin
digest.update buf while read bufsize, buf
ensure
seek off rescue nil
end
digest
end
end
File.module_eval do
singleton_class = class << self; self; end
singleton_class.module_eval do
define_method(digest_method) do |path, *argv|
mode = argv.shift || 'r'
open(path, mode){|f| f.send digest_method}
end
end
end
end
#
# demo
#
report = {}
digests.each do |d|
digest_method = d.downcase
report.update "File##{ digest_method}" => open(__FILE__){|f| f.send(digest_method).hexdigest}
report.update "File.#{ digest_method}" => File.send(digest_method, __FILE__).hexdigest
end
require 'yaml' and y report