A few questions of function and style from a newbie

Sven_Johansson · 31 December 2005 11:04

Hi, good people of clr,

I'm just dipped into the goodness that is ruby for the first time
yesterday, and while this group and the online docs proved useful, I'm
left somewhat bewildered by a few things. Environment: Win XP SP2,
one-click-install 1.8.2 ruby.

1) Current working directories:
I currently use

f = __FILE__
len = -f.length
my_dir = File::expand_path(f)[0...len]

To find the script's current working directory. Snappier alternatives
such as

my_dir = File.dirname(__FILE__)

just report back with ".", which, while true, isn't exactly helpful.

Problem: this only works if the script is invoked from the command line
as "ruby this.rb". Trying to invoke it by double-clicking on the script
in the windows explorer makes the above function return an empty
string. Is there any way, short of embedding the call to ruby in a bat
file, to make ruby read its currrent working directory even if invokend
by double-clicking?

2) MD5 hashes and file handles:
I currently use something like

Dir['*'].each {|f| print Digest::MD5.hexdigest(open(f, 'rb').read), '
', f, "\n"}

I tried stuff like

Dir['*'].each {|f|print f, " "; puts
Digest::MD5.hexdigest(File.read(f))}
or
dig=Digest::MD5.new
dig.update(file)

and they both seem to suffer from some sort of buffer on the directory
reading; that is, they'll produce the same hash for several files when
scanning a large directory. The first line above bypasses this, I
suppose by the 'rb' reading mode on the file handle. Is there any way
to unbuffer the directory file handle stream (akin to Perl's $|=1)?

3) Finally, I submit for very first ruby script for merciless
criticism. What here could have been done otherwise? What screams for a
better ruby solution? I'm aware of that I should probably look into
split instead of relying so much on regexps for splitting and I was
trying to set up a structure like hash[key]=[a,b], but I found I could
not access hash.each_pair { |key,value] puts key, value(0), value (1)
}.

···

------------------------------------------------------------------
require 'Digest/md5'
require 'fileutils'

# Variables to set manually
global_digest_index='C:/srfctrl/indexfile/globalindex.txt'
global_temp_directory='C:/srfctrl/tempstore/'
global_collide_directory='C:/srfctrl/collide/'

# Begin program
f = __FILE__
len = -f.length
my_dir = File::expand_path(f)[0...len]
my_dirname = my_dir.sub(/^.+\/(\w+?)\/$/,'\1')

puts my_dir
puts my_dirname

digest_map_name={}
digest_map_directory={}

IO.foreach(global_digest_index) { |line|
  th_dige=line.sub(/^.+?\:(.+?)\:.+?$/,'\1').chomp
  th_fnam=line.sub(/^.+?\:.+?\:(.+?)$/,'\1').chomp
  th_dir=line.sub(/^(.+?)\:.+?\:.+?$/,'\1').chomp
  digest_map_name[th_dige] = th_fnam
  digest_map_directory[th_dige] = th_dir
}

filecnt = filesuc = 0
outfile = File.new(global_digest_index, "a")
Dir['*'].each do |file_name|
  next unless (file_name =~ /\.mp3$|\.ogg$/i)
  filecnt += 1
  hex = Digest::MD5.hexdigest(open(file_name, 'rb').read)
  if digest_map_name.has_key?(hex) then
    collfilestrip = digest_map_name[hex].sub(/\.mp3$|\.ogg$/i,'')
    id_name = global_collide_directory + digest_map_directory[hex].to_s
+ '_' + collfilestrip + '_' + file_name
    FileUtils.cp(file_name,id_name)
  else
    filesuc +=1
    digest_map_name[hex] = file_name
    digest_map_directory[hex] = my_dirname
    outfile.puts my_dirname + ':' + hex + ':' + file_name
    id_name = global_temp_directory + file_name
    FileUtils.cp(digest_map_name[hex],id_name)
  end
end
outfile.close

puts "Processed " + filecnt.to_s + " files, out of which " +
filesuc.to_s + " were not duplicates."
----------------------------------------------

Robert · 31 December 2005 11:06

Hi, good people of clr,

I'm just dipped into the goodness that is ruby for the first time
yesterday, and while this group and the online docs proved useful, I'm
left somewhat bewildered by a few things. Environment: Win XP SP2,
one-click-install 1.8.2 ruby.

1) Current working directories:
I currently use

f = __FILE__
len = -f.length
my_dir = File::expand_path(f)[0...len]

To find the script's current working directory.

No, you get the script's path - although this will incidetally match with the working directory when run in Windows (because the working directory defaults to the script directory).

Snappier alternatives
such as

my_dir = File.dirname(__FILE__)

just report back with ".", which, while true, isn't exactly helpful.

You want File.expand_path like in

File.expand_path('.')

=> "/home/Robert"

Now:

working_dir = File.expand_path( Dir.getwd )
script_dir = File.expand_path( File.dirname(__FILE__) )

Problem: this only works if the script is invoked from the command
line as "ruby this.rb". Trying to invoke it by double-clicking on the
script in the windows explorer makes the above function return an
empty string. Is there any way, short of embedding the call to ruby
in a bat file, to make ruby read its currrent working directory even
if invokend by double-clicking?

See above.

2) MD5 hashes and file handles:
I currently use something like

Dir['*'].each {|f| print Digest::MD5.hexdigest(open(f, 'rb').read), '
', f, "\n"}

I tried stuff like

Dir['*'].each {|f|print f, " "; puts
Digest::MD5.hexdigest(File.read(f))}
or
dig=Digest::MD5.new
dig.update(file)

and they both seem to suffer from some sort of buffer on the directory
reading; that is, they'll produce the same hash for several files when
scanning a large directory. The first line above bypasses this, I
suppose by the 'rb' reading mode on the file handle. Is there any way
to unbuffer the directory file handle stream (akin to Perl's $|=1)?

Your code in the first line has at least these problems:

1) You don't check for directories, i.e., you'll try to create MD5 of directories as well.

2) You don't close files properly. You should use the block form of File.open - that way file handles are always closed properly and timely.

Alternatives

Dir['*'].each {|f| File.open(f,'rb') {|io| print f, " ", Digest::MD5.hexdigest(io.read), "\n" } if File.file? f}

Dir['*'].each {|f| print f, " ", Digest::MD5.hexdigest(File.open(f,'rb') {|io| io.read}), "\n" if File.file? f}

I can't reproduce the problem you state (identical digests) with the other lines of code. I tried

Dir['*'].each {|f|print f, " "; puts Digest::MD5.hexdigest(File.read(f)) if File.file? f}

But the problem here is that the file is not opened in binary mode which is a must for this to work.

3) Finally, I submit for very first ruby script for merciless
criticism. What here could have been done otherwise? What screams for
a better ruby solution? I'm aware of that I should probably look into
split instead of relying so much on regexps for splitting and I was
trying to set up a structure like hash[key]=[a,b], but I found I could
not access hash.each_pair { |key,value] puts key, value(0), value (1)
}.

------------------------------------------------------------------
require 'Digest/md5'
require 'fileutils'

# Variables to set manually
global_digest_index='C:/srfctrl/indexfile/globalindex.txt'
global_temp_directory='C:/srfctrl/tempstore/'
global_collide_directory='C:/srfctrl/collide/'

# Begin program
f = __FILE__
len = -f.length
my_dir = File::expand_path(f)[0...len]
my_dirname = my_dir.sub(/^.+\/(\w+?)\/$/,'\1')

puts my_dir
puts my_dirname

digest_map_name={}
digest_map_directory={}

IO.foreach(global_digest_index) { |line|
th_dige=line.sub(/^.+?\:(.+?)\:.+?$/,'\1').chomp
th_fnam=line.sub(/^.+?\:.+?\:(.+?)$/,'\1').chomp
th_dir=line.sub(/^(.+?)\:.+?\:.+?$/,'\1').chomp
digest_map_name[th_dige] = th_fnam
digest_map_directory[th_dige] = th_dir
}

filecnt = filesuc = 0
outfile = File.new(global_digest_index, "a")
Dir['*'].each do |file_name|
next unless (file_name =~ /\.mp3$|\.ogg$/i)
filecnt += 1
hex = Digest::MD5.hexdigest(open(file_name, 'rb').read)
if digest_map_name.has_key?(hex) then
   collfilestrip = digest_map_name[hex].sub(/\.mp3$|\.ogg$/i,'')
   id_name = global_collide_directory + digest_map_directory[hex].to_s
+ '_' + collfilestrip + '_' + file_name
   FileUtils.cp(file_name,id_name)
else
   filesuc +=1
   digest_map_name[hex] = file_name
   digest_map_directory[hex] = my_dirname
   outfile.puts my_dirname + ':' + hex + ':' + file_name
   id_name = global_temp_directory + file_name
   FileUtils.cp(digest_map_name[hex],id_name)
end
end
outfile.close

puts "Processed " + filecnt.to_s + " files, out of which " +
filesuc.to_s + " were not duplicates."
----------------------------------------------

It's not completely clear to me what you want to do here. Apparently you check a number of audio files and shove them somewhere else based on some criterion. What's the aim of doing this?

Kind regards

robert

···

Sven Johansson <sven_u_johansson@spray.se> wrote:

Sven_Johansson · 31 December 2005 11:06

Robert Klemme wrote:

Thank you for your response! Quick and clarifying at the same time.

No, you get the script's path - although this will incidetally match with
the working directory when run in Windows (because the working directory
defaults to the script directory).

You want File.expand_path like in

>> File.expand_path('.')
=> "/home/Robert"

Now:

working_dir = File.expand_path( Dir.getwd )
script_dir = File.expand_path( File.dirname(__FILE__) )

Yes, indeed. All those work as advertised, even from the explorer
shell. Thanks!

Your code in the first line has at least these problems:

1) You don't check for directories, i.e., you'll try to create MD5 of
directories as well.

2) You don't close files properly. You should use the block form of
File.open - that way file handles are always closed properly and timely.

Alternatives

Dir['*'].each {|f| File.open(f,'rb') {|io| print f, " ",
Digest::MD5.hexdigest(io.read), "\n" } if File.file? f}

Dir['*'].each {|f| print f, " ", Digest::MD5.hexdigest(File.open(f,'rb')
{|io| io.read}), "\n" if File.file? f}

I can't reproduce the problem you state (identical digests) with the other
lines of code. I tried

Dir['*'].each {|f|print f, " "; puts Digest::MD5.hexdigest(File.read(f)) if
File.file? f}

Using:
require 'Digest/mp5'
Dir['*'].each {|f|print f, " "; puts
Digest::MD5.hexdigest(File.read(f)) if File.file? f}

gives

001.mp3 6ce4ad47bfa79b6c0e48636040c1dfb9
002.mp3 6ce4ad47bfa79b6c0e48636040c1dfb9
0022-042.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-043.ogg 5947035093bbfa22a9e7cf6e69b82a4e
0022-044.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-045.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-046.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-047.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-048.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-049.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-050.ogg 5947035093bbfa22a9e7cf6e69b82a4e
0022-057.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-058.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-059.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-061.ogg a7d6f03e275d69b363b9771c9d88e681
0022-062.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-069.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-070.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-071.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-072.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-073.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-074.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-077.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-078.ogg 5947035093bbfa22a9e7cf6e69b82a4e
[snip]

which clearly isn't good. However, both your suggestested alternatives
above work just fine. It would seem that binary mode really is a must
on Win32 - exhanging 'rb' for 'r' in those suggestions gives me the
hash repeat problem again. Good to know.

But the problem here is that the file is not opened in binary mode which is
a must for this to work.

Yes, so it would seem.

It's not completely clear to me what you want to do here. Apparently you
check a number of audio files and shove them somewhere else based on some
criterion. What's the aim of doing this?

Oh, it works as it's supposed to do, so I'm not really trying to debug
it. It takes the hashes of all the files in a directory, compares them
to a global list of hashes, appends the new unique hashes to that list
and moves the corresponding files someplace, moves files that already
have "their" hashes in the list someplace else. The rest is just
morphing file names.

I was looking for more input along the line of "that's not how we do it
in ruby - this is how we would express this particular sort of
statement".

I realise that the first thing I should do is probably to read the
files by block instead of slurping them in wholesale, and that I would
be far better off maintainging the global list of hashes in a DB
instead of in a text file. I'll try my hands at the first, now that
I've gotten the hash and filehandle issue resolved above... as for the
second, taking a peek at this group reveals that making ruby talk with
mysql on Win32 isn't for the faint of heart, so I'll let that be for
now.

Thanks again!
/Sven

···

Sven Johansson <sven_u_johansson@spray.se> wrote:

Robert · 31 December 2005 11:10

Robert Klemme wrote:

Thank you for your response! Quick and clarifying at the same time.

You're welcome!

<snip/>

which clearly isn't good. However, both your suggestested alternatives
above work just fine. It would seem that binary mode really is a must
on Win32 - exhanging 'rb' for 'r' in those suggestions gives me the
hash repeat problem again. Good to know.

When calculating the hash digest of a file binary mode is really the only reasonable thing to do it. Guess you just found another reason.

It's not completely clear to me what you want to do here.
Apparently you check a number of audio files and shove them
somewhere else based on some criterion. What's the aim of doing
this?

Oh, it works as it's supposed to do, so I'm not really trying to debug
it. It takes the hashes of all the files in a directory, compares them
to a global list of hashes, appends the new unique hashes to that list
and moves the corresponding files someplace, moves files that already
have "their" hashes in the list someplace else. The rest is just
morphing file names.

I was looking for more input along the line of "that's not how we do
it in ruby - this is how we would express this particular sort of
statement".

Yes, I was aware of that. I just wanted to know the purpose of the code so I might be able to make more appropriate statements.

I realise that the first thing I should do is probably to read the
files by block instead of slurping them in wholesale, and that I would
be far better off maintainging the global list of hashes in a DB
instead of in a text file. I'll try my hands at the first, now that
I've gotten the hash and filehandle issue resolved above... as for the
second, taking a peek at this group reveals that making ruby talk with
mysql on Win32 isn't for the faint of heart, so I'll let that be for
now.

The easiest way to store some arbitrary Ruby structure is to use YAML or Marshal. I'd probably do something like this:

REPO_FILE = "repo.bin".freeze

class Repository
attr_accessor :main_dir, :duplicate_dir, :extensions

  def initialize(extensions = %w{mp3 ogg})
    @extension = extensions
    @repository = {}
  end

  def process_dir(dir)
    # find all files with the extensions we support
    Dir[File.join(dir, "*.{#{extensions.join(',')}}")].each do |f|
      process_file( File.join(dir, f) )
    end
  end

  def process_file(file)
    digest = digest(file)
    name = @repository[digest]

    if name
      target = duplicate_dir
      # ...
    else
      target = main_dir
      # ...
    end

FileUtils.cp( file, File.join( target, File.basename( file ) ) )
end

  def digest(file)
    Digest::MD5.hexdigest( File.open(file, 'rb') {|io| io.read})
  end

  def self.load(file)
    File.open(file, 'rb') {|io| Marshal.load(io)}
  end

  def save(file)
    File.open(file, 'wb') {|io| Marshal.dump(self, io)}
  end
end

repo = begin
  Repository.load( REPO_FILE )
rescue Exception => e
  # not there => create
  r = Repository.new
  r.main_dir = "foo"
  r.duplicate_dir = "bar"
  r
end

ARGV.each {|dir| repo.process_dir(dir)}

repo.save( REPO_FILE )

The main point being here to encapsulate certain functionality into methods of their own. This greatly increases readability and reusability.

Kind regards

robert

···

Sven Johansson <sven_u_johansson@spray.se> wrote:

Tim_Hammerquist · 31 December 2005 22:32

[ snip ]

Using:
require 'Digest/mp5'
Dir['*'].each {|f|print f, " "; puts
Digest::MD5.hexdigest(File.read(f)) if File.file? f}

gives

001.mp3 6ce4ad47bfa79b6c0e48636040c1dfb9
002.mp3 6ce4ad47bfa79b6c0e48636040c1dfb9
0022-042.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-043.ogg 5947035093bbfa22a9e7cf6e69b82a4e
0022-044.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-045.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-046.ogg 4cac5ea5e666942920aff937aa9b3ee5
0022-047.ogg 4cac5ea5e666942920aff937aa9b3ee5
[snip]

which clearly isn't good. However, both your suggestested
alternatives above work just fine. It would seem that binary
mode really is a must on Win32 - exhanging 'rb' for 'r' in
those suggestions gives me the hash repeat problem again. Good
to know.

Just for our edification, would you run this following code on
those same files?

require 'digest/md5'

files = Dir['*'].select { |f| File.file?(f) }

files.each { |filename|
fs_size = File.size(filename) # get size of file from OS

data = File.read(filename) # read the file
data_size = data.length # get the size of the data read

hash = Digest::MD5.hexdigest(data) # calculate hash

    # compare amount of data on filesystem
    # with amount of data read
    puts "#{hash} - #{filename}: #{data_size}/#{fs_size}"
}

If my guess is correct, you should get some strange results.
Please post these results, no matter the outcome, to the group
so we can preserve a real-world example.

Cheers!
Tim Hammerquist

···

Sven Johansson <sven_u_johansson@spray.se> wrote:

Sven_Johansson · 31 December 2005 11:24

Robert Klemme wrote:

The easiest way to store some arbitrary Ruby structure is to use YAML or
Marshal. I'd probably do something like this:

REPO_FILE = "repo.bin".freeze

class Repository
  attr_accessor :main_dir, :duplicate_dir, :extensions

  def initialize(extensions = %w{mp3 ogg})
    @extension = extensions
    @repository = {}
  end

  def process_dir(dir)
    # find all files with the extensions we support
    Dir[File.join(dir, "*.{#{extensions.join(',')}}")].each do |f|
      process_file( File.join(dir, f) )
    end
  end

  def process_file(file)
    digest = digest(file)
    name = @repository[digest]

    if name
      target = duplicate_dir
      # ...
    else
      target = main_dir
      # ...
    end

    FileUtils.cp( file, File.join( target, File.basename( file ) ) )
  end

  def digest(file)
    Digest::MD5.hexdigest( File.open(file, 'rb') {|io| io.read})
  end

  def self.load(file)
    File.open(file, 'rb') {|io| Marshal.load(io)}
  end

  def save(file)
    File.open(file, 'wb') {|io| Marshal.dump(self, io)}
  end
end

repo = begin
  Repository.load( REPO_FILE )
rescue Exception => e
  # not there => create
  r = Repository.new
  r.main_dir = "foo"
  r.duplicate_dir = "bar"
  r
end

ARGV.each {|dir| repo.process_dir(dir)}

repo.save( REPO_FILE )

The main point being here to encapsulate certain functionality into methods
of their own. This greatly increases readability and reusability.

Very informative indeed, if perhaps more than a bit humbling! Thank you
again.

One last question, then - while the style above is easily more readable
and quite... enjoyable, for lack of a better word, to read, how does
Ruby measure up when it comes to passing all those variables around to
functions (method calls) all the time? Do I lose significant
performance by having method calls in inner loops? And no, I can hear
it already; "Dude, you traverse big directories, do calculations on a
big number of big files and push the filesystem to it's limits copying
them like there was no tomorrow already..." Obviously, it doesn't
matter here. But would it matter if one was wrtiting, say, a port
listener or some other reasonably performance critical application?

Sven_Johansson · 1 January 2006 16:07

Tim Hammerquist wrote:

Just for our edification, would you run this following code on
those same files?

require 'digest/md5'

files = Dir['*'].select { |f| File.file?(f) }

files.each { |filename|
    fs_size = File.size(filename) # get size of file from OS

    data = File.read(filename) # read the file
    data_size = data.length # get the size of the data read

    hash = Digest::MD5.hexdigest(data) # calculate hash

    # compare amount of data on filesystem
    # with amount of data read
    puts "#{hash} - #{filename}: #{data_size}/#{fs_size}"
}

Sure. Here it is:

6ce4ad47bfa79b6c0e48636040c1dfb9 - 001.mp3: 52/50344
6ce4ad47bfa79b6c0e48636040c1dfb9 - 002.mp3: 52/52468
4cac5ea5e666942920aff937aa9b3ee5 - 0022-042.ogg: 335/141226
5947035093bbfa22a9e7cf6e69b82a4e - 0022-043.ogg: 335/118208
4cac5ea5e666942920aff937aa9b3ee5 - 0022-044.ogg: 335/178869
4cac5ea5e666942920aff937aa9b3ee5 - 0022-045.ogg: 335/181622
4cac5ea5e666942920aff937aa9b3ee5 - 0022-046.ogg: 335/154218
4cac5ea5e666942920aff937aa9b3ee5 - 0022-047.ogg: 335/161483
4cac5ea5e666942920aff937aa9b3ee5 - 0022-048.oog: 335/147162
4cac5ea5e666942920aff937aa9b3ee5 - 0022-049.ogg: 335/145142
5947035093bbfa22a9e7cf6e69b82a4e - 0022-050.ogg: 335/149968
4cac5ea5e666942920aff937aa9b3ee5 - 0022-057.ogg: 335/161358
4cac5ea5e666942920aff937aa9b3ee5 - 0022-058.ogg: 335/156026
4cac5ea5e666942920aff937aa9b3ee5 - 0022-059.ogg: 335/176575
a7d6f03e275d69b363b9771c9d88e681 - 0022-061.ogg: 335/148704
4cac5ea5e666942920aff937aa9b3ee5 - 0022-062.ogg: 335/186715
4cac5ea5e666942920aff937aa9b3ee5 - 0022-069.ogg: 335/173036
4cac5ea5e666942920aff937aa9b3ee5 - 0022-070.ogg: 335/173752
4cac5ea5e666942920aff937aa9b3ee5 - 0022-071.ogg: 335/173581
[snip]

Which... hmm... does this mean that File.read(filename) will only read
as far as the first percieved end of line in the binary file? Here I
thought that would slurp up the entire file no matter what, even if it
played havoc with the "lines" of the file. Given that it seems to read
as much per file for each file type, it would seem it just reads and
hashes the file header before it encounters something that it considers
to be an end of line. But then again, shouldn't all the hashes be
identical for the same header - if they are not, you'd think it'd read
somewhat more or less of the file?

Robert · 31 December 2005 12:32

<snip/>

One last question, then - while the style above is easily more
readable and quite... enjoyable, for lack of a better word, to read,
how does Ruby measure up when it comes to passing all those variables
around to functions (method calls) all the time? Do I lose significant
performance by having method calls in inner loops? And no, I can hear
it already; "Dude, you traverse big directories, do calculations on a
big number of big files and push the filesystem to it's limits copying
them like there was no tomorrow already..." Obviously, it doesn't
matter here. But would it matter if one was wrtiting, say, a port
listener or some other reasonably performance critical application?

There are two ways to answer this: reasoning and testing. You'll get the definitive answer only by measuring performance of a real application. On the theoretical side we can state this: first, Ruby uses call by value but values are object references (i.e. objects are not copied as they are with call by value in C++ and there are two references so assignment does not affect the calling environment); this has rather low overhead compared to a real call by value where objects must be copied. Second, every method call has a certain overhead attached to it (unless a runtime system as the Java VM inlines it at run time).

A simple test shows that there is indeed significant overhead attached to method invocations - if methods perform simple tasks. The relative overhead of course depends on the work the method performs. I for my part would always start with a modularized version and only inline methods if this is actually a cure for a performance problem. There is a famous quote about premature optimization...

#! /usr/bin/env ruby

require 'benchmark'

REP = 1_000_000

def foo(n) 0 + n end

Benchmark.bmbm(10) do |bm|
  bm.report("direct") do
    REP.times { x = 0 + 1 }
  end

  bm.report("method") do
    REP.times { x = foo(1) }
  end
end

Rehearsal ---------------------------------------------
direct 1.188000 0.000000 1.188000 ( 1.201000)
method 2.156000 0.000000 2.156000 ( 2.166000)
------------------------------------ total: 3.344000sec

user system total real
direct 1.187000 0.000000 1.187000 ( 1.217000)
method 2.172000 0.000000 2.172000 ( 2.234000)

$ ruby -e 'puts 2.172000 / 1.187000'
1.82982308340354

Happy new year!

robert

···

Sven Johansson <sven_u_johansson@spray.se> wrote:

Tim_Hammerquist · 1 January 2006 16:42

Tim Hammerquist wrote:
> Just for our edification, would you run this following code
> on those same files?
>
> require 'digest/md5'
>
> files = Dir['*'].select { |f| File.file?(f) }
>
> files.each { |filename|
> fs_size = File.size(filename) # get size of file from OS
>
> data = File.read(filename) # read the file
> data_size = data.length # get the size of the data read
>
> hash = Digest::MD5.hexdigest(data) # calculate hash
>
> # compare amount of data on filesystem
> # with amount of data read
> puts "#{hash} - #{filename}: #{data_size}/#{fs_size}"
> }
>

Sure. Here it is:

6ce4ad47bfa79b6c0e48636040c1dfb9 - 001.mp3: 52/50344
6ce4ad47bfa79b6c0e48636040c1dfb9 - 002.mp3: 52/52468
4cac5ea5e666942920aff937aa9b3ee5 - 0022-042.ogg: 335/141226
5947035093bbfa22a9e7cf6e69b82a4e - 0022-043.ogg: 335/118208
4cac5ea5e666942920aff937aa9b3ee5 - 0022-044.ogg: 335/178869
4cac5ea5e666942920aff937aa9b3ee5 - 0022-045.ogg: 335/181622
4cac5ea5e666942920aff937aa9b3ee5 - 0022-046.ogg: 335/154218
4cac5ea5e666942920aff937aa9b3ee5 - 0022-047.ogg: 335/161483
4cac5ea5e666942920aff937aa9b3ee5 - 0022-048.oog: 335/147162
4cac5ea5e666942920aff937aa9b3ee5 - 0022-049.ogg: 335/145142
5947035093bbfa22a9e7cf6e69b82a4e - 0022-050.ogg: 335/149968
4cac5ea5e666942920aff937aa9b3ee5 - 0022-057.ogg: 335/161358
4cac5ea5e666942920aff937aa9b3ee5 - 0022-058.ogg: 335/156026
4cac5ea5e666942920aff937aa9b3ee5 - 0022-059.ogg: 335/176575
a7d6f03e275d69b363b9771c9d88e681 - 0022-061.ogg: 335/148704
4cac5ea5e666942920aff937aa9b3ee5 - 0022-062.ogg: 335/186715
4cac5ea5e666942920aff937aa9b3ee5 - 0022-069.ogg: 335/173036
4cac5ea5e666942920aff937aa9b3ee5 - 0022-070.ogg: 335/173752
4cac5ea5e666942920aff937aa9b3ee5 - 0022-071.ogg: 335/173581
[snip]

Which... hmm... does this mean that File.read(filename) will
only read as far as the first percieved end of line in the
binary file? Here I thought that would slurp up the entire
file no matter what, even if it played havoc with the "lines"
of the file.

You were right. It read the whole file, right up until the EOF.
But in DOS/Windows text mode, the ASCII 26 character (^Z) is the
EOF marker.

Given that it seems to read as much per file for each file
type, it would seem it just reads and hashes the file header
before it encounters something that it considers to be anwend
of line.

I'm not an mp3/ogg file format specialist, but it
looks like both your mp3 and ogg files contain that EOF marker
in their headers, and that the first several hundred bytes of
many of these ogg files are the same, hence the identical
hashes.

This is a prime example of why binary read mode is necessary on
a DOS/Win platform. If you add the 'b' flag to that File read
operation and re-run the script, you should see matching file
sizes and differing hashes. (I don't have a Windows box at the
moment.)

Cheers!
Tim Hammerquist

···

Sven Johansson <sven_u_johansson@spray.se> wrote:

Sven_Johansson · 1 January 2006 16:12

Robert Klemme wrote:

A simple test shows that there is indeed significant overhead attached to
method invocations - if methods perform simple tasks. The relative overhead
of course depends on the work the method performs. I for my part would
always start with a modularized version and only inline methods if this is
actually a cure for a performance problem. There is a famous quote about
premature optimization...

Heh. But how can we know if it premature unless we have an idea about
how inefficient method calls are? Nevertheless, you point is well
taken.

#! /usr/bin/env ruby

require 'benchmark'

REP = 1_000_000

def foo(n) 0 + n end

Benchmark.bmbm(10) do |bm|
  bm.report("direct") do
    REP.times { x = 0 + 1 }
  end

  bm.report("method") do
    REP.times { x = foo(1) }
  end
end

Rehearsal ---------------------------------------------
direct 1.188000 0.000000 1.188000 ( 1.201000)
method 2.156000 0.000000 2.156000 ( 2.166000)
------------------------------------ total: 3.344000sec

                user system total real
direct 1.187000 0.000000 1.187000 ( 1.217000)
method 2.172000 0.000000 2.172000 ( 2.234000)

$ ruby -e 'puts 2.172000 / 1.187000'
1.82982308340354

Ah, interesting. And a fine exemple of how to use the internal
benchmarking support for us Ruby newbies. More to the point, it shows
that it isn't so bad - I was thinking of order-of-magnitude losses, and
here is merely a factor two or so, and that's for essentially empty
method bodies... it will do.

Happy new year!

To you as well!
/Sven

Ryan_Davis2 · 2 January 2006 09:12

No, you discovered the difference between 1 method invocation (Fixnum.+) and 2 (Kernel.foo and Fixnum.+). If you are worried about times, I'd look at using a good profiler instead of the benchmarks so you can get insight on where your time is actually being spent (it sure isn't on Fixnum.+). Don't use the standard profiler, use zenprofiler or shugo's profiler.

···

On Jan 1, 2006, at 8:12 AM, Sven Johansson wrote:

require 'benchmark'

REP = 1_000_000

def foo(n) 0 + n end

Benchmark.bmbm(10) do |bm|
  bm.report("direct") do
    REP.times { x = 0 + 1 }
  end

  bm.report("method") do
    REP.times { x = foo(1) }
  end
end

Ah, interesting. And a fine exemple of how to use the internal
benchmarking support for us Ruby newbies. More to the point, it shows
that it isn't so bad - I was thinking of order-of-magnitude losses, and
here is merely a factor two or so, and that's for essentially empty
method bodies... it will do.

--
ryand-ruby@zenspider.com - http://blog.zenspider.com/
http://rubyforge.org/projects/ruby2c/
http://rubyforge.org/projects/rubyinline/

Ryan_Davis1 · 2 January 2006 09:14

No, you discovered the difference between 1 method invocation (Fixnum.+) and 2 (Kernel.foo and Fixnum.+). "0 + 1" is a method invocation just like any other. You can see that by using ParseTree's parse_tree_show utility:

[:call, [:lit, 0], :+, [:array, [:lit, 1]]]

If you are worried about times, I'd look at using a good profiler instead of the benchmarks so you can get insight on where your time is actually being spent (it sure isn't on Fixnum.+). Don't use the standard profiler, use zenprofiler or shugo's profiler.

···

On Jan 1, 2006, at 8:12 AM, Sven Johansson wrote:

require 'benchmark'

REP = 1_000_000

def foo(n) 0 + n end

Benchmark.bmbm(10) do |bm|
  bm.report("direct") do
    REP.times { x = 0 + 1 }
  end

  bm.report("method") do
    REP.times { x = foo(1) }
  end
end

Ah, interesting. And a fine exemple of how to use the internal
benchmarking support for us Ruby newbies. More to the point, it shows
that it isn't so bad - I was thinking of order-of-magnitude losses, and
here is merely a factor two or so, and that's for essentially empty
method bodies... it will do.

--
ryand-ruby@zenspider.com - http://blog.zenspider.com/
http://rubyforge.org/projects/ruby2c/
http://rubyforge.org/projects/rubyinline/

Topic		Replies	Views
Another Newb asks questions ruby-talk	12	81	31 December 2005
Reading x bytes at a time ruby-talk	2	134	19 August 2008
Md5 function in Ruby ruby-talk	13	134	30 January 2009
Read a ruby script like you would read a text file ruby-talk	2	126	27 January 2009
A bundle of newbie queries ruby-talk	13	114	2 August 2003

A few questions of function and style from a newbie

Related topics