Optimizing Hash deletions

I have a recursive file-compare program that I run on 2 directories that
each contain about 10,000 files, 99% of which don't change. I read all the
filenames into hashes and then (first) delete all the matches, using this
code:

def RemoveIdenticalDirs ( old_fe_list , new_fe_list )
    # this function is necessary because we have to do a case-insensitive
match for directories, but not for files
    $stderr.puts "Detecting identical dirs..."
    old_fe_list.keys.each do | oldfile |
        new_fe_list.keys.each do | newfile | # can't use .has_key? because
we have to do a case-insensitive match
            if old_fe_list[oldfile].is_dir and new_fe_list[newfile].is_dir
and oldfile.downcase == newfile.downcase
                old_fe_list.delete ( oldfile )
                new_fe_list.delete ( newfile )
                break
            end
        end
    end
end

def RemoveIdenticalFiles ( old_fe_list , new_fe_list , comparetype )
    $stderr.puts "Detecting identical files..."
    old_fe_list.keys.each do | file |
        if new_fe_list.has_key? ( file )
            if !(old_fe_list[file].is_dir) and !(new_fe_list[file].is_dir)
and FilesAreIdentical ( comparetype , old_fe_list[file] , new_fe_list[file]
)
                old_fe_list.delete ( file )
                new_fe_list.delete ( file )
            end
        end
    end
end

Note that I've added an attribute to each hash called .is_dir which just
holds a Boolean value.

When I run the program (under WinXP using Ruby 185-21, if it matters), it
takes about 5-10 seconds to execute the above 2 functions. But that's not
the worst part - it chews up memory big time. The machine I'm running it on
has 768MB of RAM (with no swap file), and Windows gives me a warning that
it's out of memory as the above code runs. However, neither Windows nor the
program crashes. I'm guessing that all the .delete's are causing lots of
memory usage and the garbage collector starts running.

So my question is - is there a less memory-intensive way of doing the above?
Would setting elements to nil instead of deleting them make any difference?
Or maybe copying entries to a new hash somehow?

Mike Steiner

I have a recursive file-compare program that I run on 2 directories that
each contain about 10,000 files, 99% of which don't change. I read all the
filenames into hashes and then (first) delete all the matches, using this
code:

def RemoveIdenticalDirs ( old_fe_list , new_fe_list )
   # this function is necessary because we have to do a case-insensitive
match for directories, but not for files
   $stderr.puts "Detecting identical dirs..."
   old_fe_list.keys.each do | oldfile |
       new_fe_list.keys.each do | newfile | # can't use .has_key? because
we have to do a case-insensitive match
           if old_fe_list[oldfile].is_dir and new_fe_list[newfile].is_dir
and oldfile.downcase == newfile.downcase
               old_fe_list.delete ( oldfile )
               new_fe_list.delete ( newfile )
               break
           end
       end
   end
end

def RemoveIdenticalFiles ( old_fe_list , new_fe_list , comparetype )
   $stderr.puts "Detecting identical files..."
   old_fe_list.keys.each do | file |
       if new_fe_list.has_key? ( file )
           if !(old_fe_list[file].is_dir) and !(new_fe_list[file].is_dir)
and FilesAreIdentical ( comparetype , old_fe_list[file] , new_fe_list[file]
)
               old_fe_list.delete ( file )
               new_fe_list.delete ( file )
           end
       end
   end
end

Note that I've added an attribute to each hash called .is_dir which just
holds a Boolean value.

When I run the program (under WinXP using Ruby 185-21, if it matters), it
takes about 5-10 seconds to execute the above 2 functions. But that's not
the worst part - it chews up memory big time. The machine I'm running it on
has 768MB of RAM (with no swap file), and Windows gives me a warning that
it's out of memory as the above code runs. However, neither Windows nor the
program crashes. I'm guessing that all the .delete's are causing lots of
memory usage and the garbage collector starts running.

So my question is - is there a less memory-intensive way of doing the above?
Would setting elements to nil instead of deleting them make any difference?
Or maybe copying entries to a new hash somehow?

You've got an O(n**2) search in RemoveIdenticalDirs. How long
does RemoveIdenticalDirs take to execute, as compared to RemoveIdenticalFiles?

Also, what does FilesAreIdentical() do ?

Regards,

Bill

···

From: "Mike Steiner" <mikejaysteiner@gmail.com>

Dear Mike,

why don't you first create an Array of files that you want
to delete, and do the deletion after that ?
You can start the garbage collection by force also ... maybe
like this:

class String
  def compare_dir(other_dir)
    list1 = Dir.entries(self)
    list2=Dir.entries(other_dir)
    in_first_but_not_in_second_dir=list1-list2
    # remove directories from the list
    in_first_but_not_in_second_dir.delete_if{|x| FileTest::directory?(x)}
    return in_first_but_not_in_second_dir
  end
end
class Array
  def delete_files
    self.each{|x|
      File.delete(x)
      GC.start
    }
  end
end

I compared just the names of roughly 3500 files in two
directories in a fraction of a second like this,
but I haven't tried to erase them...
Best regards,

Axel

t=Time.now
p "/home/axel".compare_dir("/home/axel/ruby").length
p Time.now-t

-------- Original-Nachricht --------

···

Datum: Tue, 26 Jun 2007 11:35:54 +0900
Von: "Mike Steiner" <mikejaysteiner@gmail.com>
An: ruby-talk@ruby-lang.org
Betreff: optimizing Hash deletions

I have a recursive file-compare program that I run on 2 directories that
each contain about 10,000 files, 99% of which don't change. I read all the
filenames into hashes and then (first) delete all the matches, using this
code:

def RemoveIdenticalDirs ( old_fe_list , new_fe_list )
    # this function is necessary because we have to do a case-insensitive
match for directories, but not for files
    $stderr.puts "Detecting identical dirs..."
    old_fe_list.keys.each do | oldfile |
        new_fe_list.keys.each do | newfile | # can't use .has_key?
because
we have to do a case-insensitive match
            if old_fe_list[oldfile].is_dir and new_fe_list[newfile].is_dir
and oldfile.downcase == newfile.downcase
                old_fe_list.delete ( oldfile )
                new_fe_list.delete ( newfile )
                break
            end
        end
    end
end

def RemoveIdenticalFiles ( old_fe_list , new_fe_list , comparetype )
    $stderr.puts "Detecting identical files..."
    old_fe_list.keys.each do | file |
        if new_fe_list.has_key? ( file )
            if !(old_fe_list[file].is_dir) and !(new_fe_list[file].is_dir)
and FilesAreIdentical ( comparetype , old_fe_list[file] ,
new_fe_list[file]
)
                old_fe_list.delete ( file )
                new_fe_list.delete ( file )
            end
        end
    end
end

Note that I've added an attribute to each hash called .is_dir which just
holds a Boolean value.

When I run the program (under WinXP using Ruby 185-21, if it matters), it
takes about 5-10 seconds to execute the above 2 functions. But that's not
the worst part - it chews up memory big time. The machine I'm running it
on
has 768MB of RAM (with no swap file), and Windows gives me a warning that
it's out of memory as the above code runs. However, neither Windows nor
the
program crashes. I'm guessing that all the .delete's are causing lots of
memory usage and the garbage collector starts running.

So my question is - is there a less memory-intensive way of doing the
above?
Would setting elements to nil instead of deleting them make any
difference?
Or maybe copying entries to a new hash somehow?

Mike Steiner

--
GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS.
Alle Infos und kostenlose Anmeldung: GMX E-Mail ✉ sichere & kostenlose E-Mail-Adresse ✉

If the files mostly never change, there may be faster ways.

If your code does the changing, then you can log whenever that gets
done. This makes the list much shorter and no pruning is needed.

Alternatively, you could touch the files so that they all have the same
timestamp. Then, you could just assemble the list of files that have a
timestamp that is different and you would not need to do all that
pruning.

···

--
Posted via http://www.ruby-forum.com/.

FilesAreIdentical() either compares the files based on size/date, or on
contents (depending on the comparetype parameter). I'm currently running it
with size/date (and it's still slow and hogs memory).

Mike Steiner

···

On 6/25/07, Bill Kelly <billk@cts.com> wrote:

From: "Mike Steiner" <mikejaysteiner@gmail.com>
>
> I have a recursive file-compare program that I run on 2 directories that
> each contain about 10,000 files, 99% of which don't change. I read all
the
> filenames into hashes and then (first) delete all the matches, using
this
> code:
>
> def RemoveIdenticalDirs ( old_fe_list , new_fe_list )
> # this function is necessary because we have to do a case-insensitive
> match for directories, but not for files
> $stderr.puts "Detecting identical dirs..."
> old_fe_list.keys.each do | oldfile |
> new_fe_list.keys.each do | newfile | # can't use .has_key?
because
> we have to do a case-insensitive match
> if old_fe_list[oldfile].is_dir and
new_fe_list[newfile].is_dir
> and oldfile.downcase == newfile.downcase
> old_fe_list.delete ( oldfile )
> new_fe_list.delete ( newfile )
> break
> end
> end
> end
> end
>
> def RemoveIdenticalFiles ( old_fe_list , new_fe_list , comparetype )
> $stderr.puts "Detecting identical files..."
> old_fe_list.keys.each do | file |
> if new_fe_list.has_key? ( file )
> if !(old_fe_list[file].is_dir) and
!(new_fe_list[file].is_dir)
> and FilesAreIdentical ( comparetype , old_fe_list[file] ,
new_fe_list[file]
> )
> old_fe_list.delete ( file )
> new_fe_list.delete ( file )
> end
> end
> end
> end
>
> Note that I've added an attribute to each hash called .is_dir which just
> holds a Boolean value.
>
> When I run the program (under WinXP using Ruby 185-21, if it matters),
it
> takes about 5-10 seconds to execute the above 2 functions. But that's
not
> the worst part - it chews up memory big time. The machine I'm running it
on
> has 768MB of RAM (with no swap file), and Windows gives me a warning
that
> it's out of memory as the above code runs. However, neither Windows nor
the
> program crashes. I'm guessing that all the .delete's are causing lots of
> memory usage and the garbage collector starts running.
>
> So my question is - is there a less memory-intensive way of doing the
above?
> Would setting elements to nil instead of deleting them make any
difference?
> Or maybe copying entries to a new hash somehow?

You've got an O(n**2) search in RemoveIdenticalDirs. How long
does RemoveIdenticalDirs take to execute, as compared to
RemoveIdenticalFiles?

Also, what does FilesAreIdentical() do ?

Regards,

Bill

RemoveIdenticalDirs takes about 2-3 times as long to run as
RemoveIdenticalFiles.

FilesAreIdentical compares 2 files based on size/date and returns
true/false.

Mike Steiner

···

On 6/25/07, Bill Kelly <billk@cts.com> wrote:

From: "Mike Steiner" <mikejaysteiner@gmail.com>
>
> I have a recursive file-compare program that I run on 2 directories that
> each contain about 10,000 files, 99% of which don't change. I read all
the
> filenames into hashes and then (first) delete all the matches, using
this
> code:
>
> def RemoveIdenticalDirs ( old_fe_list , new_fe_list )
> # this function is necessary because we have to do a case-insensitive
> match for directories, but not for files
> $stderr.puts "Detecting identical dirs..."
> old_fe_list.keys.each do | oldfile |
> new_fe_list.keys.each do | newfile | # can't use .has_key?
because
> we have to do a case-insensitive match
> if old_fe_list[oldfile].is_dir and
new_fe_list[newfile].is_dir
> and oldfile.downcase == newfile.downcase
> old_fe_list.delete ( oldfile )
> new_fe_list.delete ( newfile )
> break
> end
> end
> end
> end
>
> def RemoveIdenticalFiles ( old_fe_list , new_fe_list , comparetype )
> $stderr.puts "Detecting identical files..."
> old_fe_list.keys.each do | file |
> if new_fe_list.has_key? ( file )
> if !(old_fe_list[file].is_dir) and
!(new_fe_list[file].is_dir)
> and FilesAreIdentical ( comparetype , old_fe_list[file] ,
new_fe_list[file]
> )
> old_fe_list.delete ( file )
> new_fe_list.delete ( file )
> end
> end
> end
> end
>
> Note that I've added an attribute to each hash called .is_dir which just
> holds a Boolean value.
>
> When I run the program (under WinXP using Ruby 185-21, if it matters),
it
> takes about 5-10 seconds to execute the above 2 functions. But that's
not
> the worst part - it chews up memory big time. The machine I'm running it
on
> has 768MB of RAM (with no swap file), and Windows gives me a warning
that
> it's out of memory as the above code runs. However, neither Windows nor
the
> program crashes. I'm guessing that all the .delete's are causing lots of
> memory usage and the garbage collector starts running.
>
> So my question is - is there a less memory-intensive way of doing the
above?
> Would setting elements to nil instead of deleting them make any
difference?
> Or maybe copying entries to a new hash somehow?

You've got an O(n**2) search in RemoveIdenticalDirs. How long
does RemoveIdenticalDirs take to execute, as compared to
RemoveIdenticalFiles?

Also, what does FilesAreIdentical() do ?

Regards,

Bill

The files are changed by other programs, so I can't log those.

The speed really isn't that much of an issue, since this program isn't run
too often. It's mostly the memory usage (I estimate about 600MB peak memory
used to compare 10,000 files).

Mike Steiner

···

On 6/26/07, Lloyd Linklater <lloyd@2live4.com> wrote:

If the files mostly never change, there may be faster ways.

If your code does the changing, then you can log whenever that gets
done. This makes the list much shorter and no pruning is needed.

Alternatively, you could touch the files so that they all have the same
timestamp. Then, you could just assemble the list of files that have a
timestamp that is different and you would not need to do all that
pruning.

--
Posted via http://www.ruby-forum.com/\.