Best/better way of md5suming of really large file in ruby?

I've got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn't a problem, but I've
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I'm not sure the exact error it threw, but I'm
re-running just that part now to find out. In the meanwhile, any
suggestions?

This is how I'm generating the md5sum right now....
Digest::MD5.hexdigest(File.read(fn))

--Kyle

Kyle Schmitt wrote:

I've got a script that is going through data, and in some cases,
generating md5s of the files. Normally this isn't a problem, but I've
got a few largish (~2G) files in there, and my script is dying on it.
I ran it in a screen so I'm not sure the exact error it threw, but I'm
re-running just that part now to find out. In the meanwhile, any
suggestions?

I googled for 'md5 large files' and ended up here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834

yun

···

--
Yun Huang Yong
yun@nomitor.com ...nom nom nom
--

rthompso@raker /cpartition/hold $ ls -rlt dummyfile
-rw-r--r-- 1 rthompso staff 2147483648 2009-04-22 10:27 dummyfile
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> result = %x[md5sum dummyfile]
=> "a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
irb(main):002:0> p result
"a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
=> nil
irb(main):003:0> def timeit
irb(main):004:1> tstart = Time.now
irb(main):005:1> result = %x[md5sum dummyfile]
irb(main):006:1> tend = Time.now
irb(main):007:1> elapsed = tend - tstart
irb(main):008:1> puts elapsed.to_s
irb(main):009:1> end
=> nil
irb(main):011:0> timeit
10.633416
=> nil

···

On Wed, 2009-04-22 at 23:18 +0900, Yun Huang Yong wrote:

Kyle Schmitt wrote:
> I've got a script that is going through data, and in some cases,
> generating md5s of the files. Normally this isn't a problem, but I've
> got a few largish (~2G) files in there, and my script is dying on it.
> I ran it in a screen so I'm not sure the exact error it threw, but I'm
> re-running just that part now to find out. In the meanwhile, any
> suggestions?

I googled for 'md5 large files' and ended up here:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834

yun

more realistic...
rthompso@raker /cpartition/hold $ dd if=/dev/urandom of=dummyfile
count=4M
4194304+0 records in
4194304+0 records out
2147483648 bytes (2.1 GB) copied, 529.518 s, 4.1 MB/s
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> def timeit
irb(main):002:1> tstart = Time.now
irb(main):003:1> result = %x[md5sum dummyfile]
irb(main):004:1> tend = Time.now
irb(main):005:1> elapsed = tend - tstart
irb(main):006:1> puts elapsed.to_s
irb(main):007:1> end
=> nil
irb(main):008:0> timeit
49.366641
=> nil
irb(main):009:0> timeit
48.416673
=> nil
irb(main):010:0>

···

On Wed, 2009-04-22 at 23:34 +0900, Reid Thompson wrote:

On Wed, 2009-04-22 at 23:18 +0900, Yun Huang Yong wrote:
> Kyle Schmitt wrote:
> > I've got a script that is going through data, and in some cases,
> > generating md5s of the files. Normally this isn't a problem, but I've
> > got a few largish (~2G) files in there, and my script is dying on it.
> > I ran it in a screen so I'm not sure the exact error it threw, but I'm
> > re-running just that part now to find out. In the meanwhile, any
> > suggestions?
>
> I googled for 'md5 large files' and ended up here:
> http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/184834
>
> yun
>
rthompso@raker /cpartition/hold $ ls -rlt dummyfile
-rw-r--r-- 1 rthompso staff 2147483648 2009-04-22 10:27 dummyfile
rthompso@raker /cpartition/hold $ irb
irb(main):001:0> result = %x[md5sum dummyfile]
=> "a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
irb(main):002:0> p result
"a981130cf2b7e09f4686dc273cf7187e dummyfile\n"
=> nil
irb(main):003:0> def timeit
irb(main):004:1> tstart = Time.now
irb(main):005:1> result = %x[md5sum dummyfile]
irb(main):006:1> tend = Time.now
irb(main):007:1> elapsed = tend - tstart
irb(main):008:1> puts elapsed.to_s
irb(main):009:1> end
=> nil
irb(main):011:0> timeit
10.633416
=> nil

Thanks both of you. I'd rather not shell out using %x[, but I may end
up doing that. I tried the modified MD5, and it actually ran in close
to the same time on my work machine, have to see how it does against
my home one.

--Kyle