Counting the files in a directory

I'm writing some scripts to help manage a mail scanner used at my
work. Being a mail scanner, it's got huuuuUUUge quarantine
directories.

Now, I know I can do something along the lines of:

Dir.open("/foo").collect.length-2 #if you're wondering, the -2 is to
ignore . and ..

to get a count of what's in a directory, but the problem there is,
it's rather slow when you run that in a directory with a few thousand
files on a server under a severe (4.5>average_load>2) load.

After perusing the Dir, Find and Stat classes, I haven't seen a better way.
I thought that perhaps there was some sort of system call, at least in
Real OSes™ (Linux, *BSD, Unix, etc), that would return the number of
files inside of a directory. Something that would hopefully return in
a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20...) seconds.

Any clues?

Thanks,
         Kyle

I'm writing some scripts to help manage a mail scanner used at my
work. Being a mail scanner, it's got huuuuUUUge quarantine
directories.

Now, I know I can do something along the lines of:

Dir.open("/foo").collect.length-2 #if you're wondering, the -2 is to
ignore . and ..

You could as well do

count = Dir.entries("/foo").size - 2

to get a count of what's in a directory, but the problem there is,
it's rather slow when you run that in a directory with a few thousand
files on a server under a severe (4.5>average_load>2) load.

After perusing the Dir, Find and Stat classes, I haven't seen a better way.
I thought that perhaps there was some sort of system call, at least in
Real OSes™ (Linux, *BSD, Unix, etc), that would return the number of
files inside of a directory. Something that would hopefully return in
a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20...) seconds.

Any clues?

The major time will be IO and that cannot be changed I guess. You could however do some form of caching: read the size and the last mod date of each dir you are interested in and store that in a Hash (and write that via Marshal to disk between invocations if you process terminates in between). Then you need only check whether the mod date has changed and only read the directory if it has. Disadvantage is that you need one more IO - albeit that will pull just one block so it might pay off.

Kind regards

  robert

···

On 11.01.2008 16:19, Kyle Schmitt wrote:

Kyle Schmitt wrote:
(...)

I thought that perhaps there was some sort of system call, at least in
Real OSes™ (Linux, *BSD, Unix, etc), that would return the number of
files inside of a directory. Something that would hopefully return in
a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20...) seconds.

Any clues?

Thanks,
         Kyle

On Windows there is such a call. With Ruby you have to take a bit of a
(known) detour to get there:

require 'win32ole'
fso = WIN32OLE.new("Scripting.FileSystemObject")
folder = fso.GetFolder("C:/WINDOWS/system32
puts folder.Files.count

regards,

Siep

···

--
Posted via http://www.ruby-forum.com/\.

Entries seems to be fairly identical to collect, and it does look nicer...
but yea still slow.

The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it's often enough that we want to be
able to do it easily.

In other cases than a mail system, caching would be a very good idea though.

I'll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.

Thanks,
          Kyle

regards,

Siep

Should be read as:

···

require 'win32ole'
fso = WIN32OLE.new("Scripting.FileSystemObject")
folder = fso.GetFolder("C:/WINDOWS/system32")
puts folder.Files.count

--
Posted via http://www.ruby-forum.com/\.

Oh, that's quite an interesting way to work with files and folders :slight_smile:

-Thufir

···

On Sat, 12 Jan 2008 08:54:42 +0900, Siep Korteling wrote:

require 'win32ole'
fso = WIN32OLE.new("Scripting.FileSystemObject") folder =
fso.GetFolder("C:/WINDOWS/system32 puts folder.Files.count

Kyle Schmitt wrote:

Entries seems to be fairly identical to collect, and it does look
nicer...
but yea still slow.

The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it's often enough that we want to be
able to do it easily.

If there's a large number of files in these directories that's probably
the source of the slowness, not the method used to get the list of
entries.

Many filesystems (some less than others) don't behave as well when you
get a "large" number of files in one directory. I think the rule of
thumb I've used for ext2 filesystems is you'll start to notice a delay
when you get a few hundred entries, and you'll start to feel it when you
have thousands.

One way around this (short of installing / upgrading to a new underlying
filesystem that handles these cases better (xfs, for example)) is to
split files out into a directory tree based either on the filename
directly or a hash made from the real filename (say an MD5 hex string of
the filename and you make two levels based on the first 4 hex digits,
00/00, 00/01, ..., ff/fe, ff/ff; 00/00 contains all files for which the
hashed filename begins "0000...", etc.). The downside of this is that
you either have to walk the entire tree to see the contents, or keep an
external index of the contents (which would eliminate your needing to do
what you're trying to do and the justification for splitting things up,
but . . . :).

···

--
Posted via http://www.ruby-forum.com/\.

Entries seems to be fairly identical to collect, and it does look nicer...
but yea still slow.

As I said: it's the IO for crowded directories (see also Mike's reply).

The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it's often enough that we want to be
able to do it easily.

In other cases than a mail system, caching would be a very good idea though.

I am not sure I understand why you think it is a bad idea. If you only cache the number of files per directory where is the issue? Or is this script not invoked regularly? Probably I am missing a bit of your use case.

I'll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.

Won't help. It's really the size of the directory. Maybe you give a little more detail about your script and when it's used so we can come up with better suggestions.

Cheers

  robert

···

On 11.01.2008 19:14, Kyle Schmitt wrote:

Did you verify that this is faster? I am skeptical because this call
does basically the same: it gets the list of files in the directory
and counts them. I would expect a speedup only if there was an API
function that would directly return the number of files.

Kind regards

robert

···

2008/1/12, Siep Korteling <s.korteling@gmail.com>:

> regards,
>
> Siep

Should be read as:

> require 'win32ole'
> fso = WIN32OLE.new("Scripting.FileSystemObject")
> folder = fso.GetFolder("C:/WINDOWS/system32")
> puts folder.Files.count

--
use.inject do |as, often| as.you_can - without end

Kyle Schmitt wrote:

I'll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.

Thanks,
          Kyle

You may have already gotten here....
What kind of times does this give? ( the first run will include the initial
compilation time )
You can modify it to meet your needs ( if you have questions, just post back
) -- see man scandir
you can setup a filter function to allow returning counts for specific file
matches.
As is, it returns a count for all files, visible and hidden.

for rubyinline see:

https://rubyforge.org/projects/rubyinline

-----------snip dircount.rb--------------------------------
require 'inline'

class DirCount
    inline do | builder |
        builder.include '<dirent.h>'
        builder.include '<stdio.h>'
        builder.c "
            int count() {
                struct dirent **namelist;
                int n;
                int count;

                count = n = scandir(\".\", &namelist, 0, 0);
                if (n < 0)
                    perror(\"scandir\");
                else {
                    while(n--) {
                    /* printf(\"%s\n\", namelist[n]->d_name);*/
                    free(namelist[n]);
                    }
                    free(namelist);
                }

                return (count);
            }"
    end
end

dc = DirCount.new()
puts dc.count()
-----------snip--------------------

···

--
View this message in context: http://www.nabble.com/Counting-the-files-in-a-directory…-tp14758608p14841411.html
Sent from the ruby-talk mailing list archive at Nabble.com.

Thufir wrote:

···

On Sat, 12 Jan 2008 08:54:42 +0900, Siep Korteling wrote:

require 'win32ole'
fso = WIN32OLE.new("Scripting.FileSystemObject") folder =
fso.GetFolder("C:/WINDOWS/system32 puts folder.Files.count

Oh, that's quite an interesting way to work with files and folders :slight_smile:

-Thufir

Yes, indeed, infact it's quite efficient according to the benchmark
tests above!
By the way, is there like a link or documentation on the list of
class/methods that can be used like GetFolder,GetFile.. etc? So far i
only know abt these 2 methods..
--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote:

> puts folder.Files.count

Did you verify that this is faster? I am skeptical because this call
does basically the same: it gets the list of files in the directory
and counts them. I would expect a speedup only if there was an API
function that would directly return the number of files.

Kind regards

robert

require 'win32ole'
require 'benchmark'
ldirname = "C:/WINDOWS/system32" #2500+ files
mdirname = "C:/ruby/lib/ruby/1.8" #800+ files
sdirname="C:/ruby/lib/ruby/1.8/i386-mswin32" # 40+ files
  @fso = WIN32OLE.new("Scripting.FileSystemObject")
n=500

    Benchmark.bmbm do |x|

   x.report("fso_mixed

"){n.times{fso(ldirname);fso(mdirname);fso(sdirname)}}

   x.report("Dir_mixed

"){n.times{dir(ldirname);dir(mdirname);dir(sdirname)}}

   x.report("fso_2500 |"){n.times{fso(ldirname)}}
   x.report("Dir_2500 |"){n.times{dir(ldirname)}}
   x.report("fso_800 |"){n.times{fso(mdirname)}}
   x.report("Dir_800 |"){n.times{dir(mdirname)}}
   x.report("fso_40 |"){n.times{fso(sdirname)}}
   x.report("Dir_40 |"){n.times{dir(sdirname)}}

   def fso(dirname)
     folder = @fso.GetFolder(dirname)
     count = folder.Files.count
   end

   def dir(dirname)
     count = Dir.entries(dirname).size - 2
   end

end

results in:

                  user system total real
fso_mixed | 0.360000 1.222000 1.582000 ( 1.673000)
Dir_mixed | 3.635000 1.382000 5.017000 ( 5.157000)
fso_2500 | 0.271000 1.071000 1.342000 ( 1.362000)
Dir_2500 | 3.305000 1.282000 4.587000 ( 4.697000)
fso_800 | 0.040000 0.120000 0.160000 ( 0.160000)
Dir_800 | 0.170000 0.100000 0.270000 ( 0.281000)
fso_40 | 0.020000 0.080000 0.100000 ( 0.110000)
Dir_40 | 0.050000 0.071000 0.121000 ( 0.120000)

Apparently, for small directories it doesn't make much difference. For
large directories it does. (filesystem ntfs).

Siep

···

2008/1/12, Siep Korteling <s.korteling@gmail.com>:

--
Posted via http://www.ruby-forum.com/\.

Robert,
          The script itself won't be run as routinely as the
directories are rotated. The directories have a daily rotation so
there are only the most recent 10 days available at once, but the
script itself may only be invoked once or twice in a month, at most.

I understand that the size of the directory itself is a problem, but I
was hoping that somehow there was a way to get a simple, more
efficient count. I know the b-tree based file systems are somewhat
new in unix & unix-like systems, I was just hoping there was some more
efficient way :slight_smile:

The script itself (as it stands now, albeit slower than I would have
liked) does the following:
With no arguments, lists the number of quarantined and spam messages
being held, for each day.
With a date, lists the file names of the quarantined messages, as well
as their recipients.
With a date and the file name of a quarantined message, warns the
user, asks them if they want to continue, then moves the message back
into the appropriate queue to be delivered.

Thanks

--Kyle

Mike,
        I've been an advocate of using the right file system for the
job for ages now, but the sad matter is, this is running on a rather
old version of RedHat, which doesn't support anything real other than
ext2 & 3. As for our possible upgrade paths to this box, it would
still be RedHat, or a clone (CentOS). From what I can see, they still
don't support modern file systems by default. Admittedly I'm tempted
to add the support myself (it's not hard), but then it'll bring up the
"its a production system" argument here.

*sigh*
--Kyle

···

On Jan 11, 2008 1:06 PM, Mike Fletcher <lemurific+rforum@gmail.com> wrote:

Kyle Schmitt wrote:
> Entries seems to be fairly identical to collect, and it does look
> nicer...
> but yea still slow.
>
> The problem with caching is that we only keep quarantine directories
> around for 10 days, due to their size and the relative rarity of us
> needing to pull something out of it. One reason for writing this as a
> script is that we recover rarely enough that whoever is doing it
> forgot how to recover. Still, it's often enough that we want to be
> able to do it easily.

If there's a large number of files in these directories that's probably
the source of the slowness, not the method used to get the list of
entries.

Many filesystems (some less than others) don't behave as well when you
get a "large" number of files in one directory. I think the rule of
thumb I've used for ext2 filesystems is you'll start to notice a delay
when you get a few hundred entries, and you'll start to feel it when you
have thousands.

One way around this (short of installing / upgrading to a new underlying
filesystem that handles these cases better (xfs, for example)) is to
split files out into a directory tree based either on the filename
directly or a hash made from the real filename (say an MD5 hex string of
the filename and you make two levels based on the first 4 hex digits,
00/00, 00/01, ..., ff/fe, ff/ff; 00/00 contains all files for which the
hashed filename begins "0000...", etc.). The downside of this is that
you either have to walk the entire tree to see the contents, or keep an
external index of the contents (which would eliminate your needing to do
what you're trying to do and the justification for splitting things up,
but . . . :).

--
Posted via http://www.ruby-forum.com/\.

Amazing. So there is probably some room for improvement of the
Windows build of Ruby. :slight_smile:

Btw, you did not do the subtraction of two - does GetFolder not return
"." and ".."?

Kind regards

robert

···

2008/1/14, Siep Korteling <s.korteling@gmail.com>:

Robert Klemme wrote:
> 2008/1/12, Siep Korteling <s.korteling@gmail.com>:
>> > puts folder.Files.count
> Did you verify that this is faster? I am skeptical because this call
> does basically the same: it gets the list of files in the directory
> and counts them. I would expect a speedup only if there was an API
> function that would directly return the number of files.
>
> Kind regards
>
> robert

require 'win32ole'
require 'benchmark'
ldirname = "C:/WINDOWS/system32" #2500+ files
mdirname = "C:/ruby/lib/ruby/1.8" #800+ files
sdirname="C:/ruby/lib/ruby/1.8/i386-mswin32" # 40+ files
  @fso = WIN32OLE.new("Scripting.FileSystemObject")
n=500

    Benchmark.bmbm do |x|

   x.report("fso_mixed
>"){n.times{fso(ldirname);fso(mdirname);fso(sdirname)}}
   x.report("Dir_mixed
>"){n.times{dir(ldirname);dir(mdirname);dir(sdirname)}}
   x.report("fso_2500 |"){n.times{fso(ldirname)}}
   x.report("Dir_2500 |"){n.times{dir(ldirname)}}
   x.report("fso_800 |"){n.times{fso(mdirname)}}
   x.report("Dir_800 |"){n.times{dir(mdirname)}}
   x.report("fso_40 |"){n.times{fso(sdirname)}}
   x.report("Dir_40 |"){n.times{dir(sdirname)}}

   def fso(dirname)
     folder = @fso.GetFolder(dirname)
     count = folder.Files.count
   end

   def dir(dirname)
     count = Dir.entries(dirname).size - 2
   end

end

results in:

                  user system total real
fso_mixed | 0.360000 1.222000 1.582000 ( 1.673000)
Dir_mixed | 3.635000 1.382000 5.017000 ( 5.157000)
fso_2500 | 0.271000 1.071000 1.342000 ( 1.362000)
Dir_2500 | 3.305000 1.282000 4.587000 ( 4.697000)
fso_800 | 0.040000 0.120000 0.160000 ( 0.160000)
Dir_800 | 0.170000 0.100000 0.270000 ( 0.281000)
fso_40 | 0.020000 0.080000 0.100000 ( 0.110000)
Dir_40 | 0.050000 0.071000 0.121000 ( 0.120000)

Apparently, for small directories it doesn't make much difference. For
large directories it does. (filesystem ntfs).

--
use.inject do |as, often| as.you_can - without end

Robert Klemme wrote:

Btw, you did not do the subtraction of two - does GetFolder not return
"." and ".."?

Kind regards

robert

require 'win32ole'

dir = WIN32OLE.new("Scripting.FileSystemObject").GetFolder("C:/ruby/" )
dir.Files.each{|file| puts file.name}
# no dots here
puts

folder.SubFolders.each{|subdir| puts subdir.name}
# still no dots

I don't now where the dots have gone, but I don't miss them in this
context.

regards,

Siep

···

--
Posted via http://www.ruby-forum.com/\.

Siep,
       I've had a bit of experience with the Win32OLE objects in ruby
before (at my last job). You're right, they are a detour, though
sometimes win32ole may feel more like a byway where your car breaks
down and the only place for you to stay that night is the Bates
Motel....

Humm. There must be some way..

--Kyle

···

On Jan 14, 2008 10:16 AM, Siep Korteling <s.korteling@gmail.com> wrote:

You sure mean the "Gates Motel", don't you? :slight_smile:

  robert

···

On 14.01.2008 18:12, Kyle Schmitt wrote:

On Jan 14, 2008 10:16 AM, Siep Korteling <s.korteling@gmail.com> wrote:

Siep,
       I've had a bit of experience with the Win32OLE objects in ruby
before (at my last job). You're right, they are a detour, though
sometimes win32ole may feel more like a byway where your car breaks
down and the only place for you to stay that night is the Bates
Motel....