Are Unix tools just slow?

Awhile back I was asking for help with a unixy way to search the mounted
volumes on my OSX machine for executable files. With help from you all, I
came up with a system call that worked but it was taking over two minutes.

I decided that I couldn’t live with that so I rewrote the routine in fairly
high level C (no parameter block stuff – all done with FSIterators).

My first working version, debug code and all, runs in about 0.2 seconds.

On this old box,
sudo find / > /dev/null
Takes 61 seconds so I don’t have any hope of avoiding the C code.

It got me wondering, though. Are the unix file inquiry tools just really
slow on Mac OSX or are they slow everywhere? Did Unix sacrifice efficiency
for portability?

···


Many individuals have, like uncut diamonds, shining qualities beneath a
rough exterior. - Juvenal, poet (c. 60-140)

On this old box,
sudo find / > /dev/null
Takes 61 seconds so I don’t have any hope of avoiding the C code.

How long would you expect an enumeration of the names of every file and
directory visible on the system to take?

(Note that the fact that you are redirecting stdout to /dev/null in your
shell session is not apparent to find(1) so it can’t do the obvious
optimization.)

It got me wondering, though. Are the unix file inquiry tools just really
slow on Mac OSX or are they slow everywhere?

To my experience (I’m not an OS/X user) the unix command-line tools are
not slow; though that statement predicates knowledge regarding the use
of the right tool for the task at hand [0].

Did Unix sacrifice efficiency for portability?

Not to my knowledge.

[0] and predicates the knowledge of knowing when to craft a specialized
tool.

Rick

···


http://www.rickbradley.com MUPRN: 248 (92F/97F)
> Why don’t bunnies make
random email haiku | noise when they make love? >Because
> they have cotton balls.

Chris Gehlker wrote:

Awhile back I was asking for help with a unixy way to search the mounted
volumes on my OSX machine for executable files. With help from you all, I
came up with a system call that worked but it was taking over two minutes.

To find all the files on the system means to walk the entire directory
tree, starting at ‘root’. There’s a lot going on behind the scene.
It’s just
how UNIX has always been. It’s not DOS, CPM, AmigaDOS, Atari DOS, OS/2,
VMS, RT-11, RSTS, … :slight_smile:

Note that there’s no user access to the raw file-system information.
If a normal user can read the hard-drive, then they can bypass
all file permissions. Can’t have that, now can we?
(Of course, we can all think up twenty ways to fix it! :slight_smile:

My first working version, debug code and all, runs in about 0.2 seconds.

Was that on a freshly booted system?
If you tried it after running a regular ‘find’,
all the information might have been in the buffer cache.

The first time I do a “find /”, it does take a little while.
When I repeat it, it takes about one second
to generate about 110K lines, about 5 MB.
It did the same amount of work walking the directory tree,
but all the directories and disk blocks were in the buffer cache.

Are the unix file inquiry tools just really
slow on Mac OSX or are they slow everywhere?

Specifically, you mean the time to search the entire file-system
structure, enumerating all path names?

Did Unix sacrifice efficiency for portability?

Well, there’s always that trade off, isn’t there?
How often does the typical administrator need a complete
list of files on the system? Both official and user-made?
“find /” isn’t a high-priority event in the life of a system.
My answer to your question: No, I don’t think so.

Even more off thread: old V7 UNIX had some commands called
‘icheck’ and ‘ncheck’. They could do cool things like convert
a inode-number to a path name, and generate a list of file-names
and inum’s directly from the file-system image on the disk.
Needed to be root, since it read the disk directly.
Nowadays, with ext2, ufs, fat, Reiser, and other file-systems,
you’d need lots of different versions of those commands.

···


Mike Hall
http://www.enteract.com/~mghall

Awhile back I was asking for help with a unixy way to search the mounted
volumes on my OSX machine for executable files. With help from you all,
I came up with a system call that worked but it was taking over two
minutes.

What system call?

I decided that I couldn’t live with that so I rewrote the routine in
fairly high level C (no parameter block stuff – all done with
FSIterators).

My first working version, debug code and all, runs in about 0.2 seconds.

I think we’re talking about apples and oranges here. Your code walks the
entire filesystem tree in two tenths of a second? I think not. Maybe you
did a find or somesuch first that cached the filesystem information in
memory, and then you called your C code? I’d like to know what the timing
results for your C code are on a freshly rebooted system.

On this old box,
sudo find / > /dev/null
Takes 61 seconds so I don’t have any hope of avoiding the C code.

Again, if you do a ‘find /’ once, it is going to take a long time, but do
it again, on a mahine with sufficient memory, it is going to be pretty
fast because the info is buffered in memory.

It got me wondering, though. Are the unix file inquiry tools just really
slow on Mac OSX or are they slow everywhere? Did Unix sacrifice
efficiency for portability?

No. I must admit that I haven’t used OS X, but doing a find over the
entire filesystem of similar sizes on a Windows machine and on a Unix box
gives similar timing results. Actually, my informal testing shows the Unix
box to be faster, but this is by no means a scientifically-arrived at
result, just a shooting-from-the-hip test.

I guess my real contribution to this thread: I recently saw a post about
Ruby FAM. I thought it would be cool to write a tool that kept the
filesystem information in an RDBMS, like PostgreSQL, using FAM to keep it
updated, and also giving the ability to update the actual filesystem when
the database is updated. It would be really fast and easy to use with a
lot more features when compared to slocate. I’ve started to give it a
shot, but what do you all think?

···

On Sat, 15 Jun 2002 07:14:38 +0900 Chris Gehlker gehlker@fastq.com wrote:


“Daniel P. Zepeda” <daniel@z,e,p,e,d,a,-,z,o,n,e.net>
(Remove commas for address)

Mike Hall wrote:

Even more off thread: old V7 UNIX had some commands called
‘icheck’ and ‘ncheck’.

Just for fun:

···

PDP-11/70 Connected…booting…
@boot
New Boot, known devices are hp ht rk rl rp tm vt
: rl(0,0)unix
mem = 177856

man icheck

ICHECK(1M) UNIX Programmer’s Manual ICHECK(1M)

NAME
icheck - file system storage consistency check

SYNOPSIS
icheck [ -s ] [ -b numbers ] [ filesystem ]

DESCRIPTION
Icheck examines a file system, builds a bit map of used
blocks, and compares this bit map against the free list
maintained on the file system. If the file system is not
specified, a set of default file systems is checked. The
normal output of icheck includes a report of

       The total number of files and the numbers of regular,
       directory, block special and character special files.

       The total number of blocks in use and the numbers of
       single-, double-, and triple-indirect blocks and direc-
       tory blocks.

       The number of free blocks.

       The number of blocks missing; i.e. not in any file nor
       in the free list.

  The -s option causes icheck to ignore the actual free list
  and reconstruct a new one by rewriting the super-block of
  the file system.  The file system should be dismounted while
  this is done; if this is not possible (for example if the
  root file system has to be salvaged) care should be taken
  that the system is quiescent and that it is rebooted immedi-
  ately afterwards so that the old, bad in-core copy of the
  super-block will not continue to be used.  Notice also that
  the words in the super-block which indicate the size of the
  free list and of the i-list are believed.  If the super-
  block has been curdled these words will have to be patched.
  The -s option causes the normal output reports to be
  suppressed.

  Following the -b option is a list of block numbers; whenever
  any of the named blocks turns up in a file, a diagnostic is
  produced.

  Icheck is faster if the raw version of the special file is
  used, since it reads the i-list many blocks at a time.

FILES
Default file systems vary with installation.

SEE ALSO
dcheck(1), ncheck(1), filsys(5), clri(1)

Printed 9/22/88 1

ICHECK(1M) UNIX Programmer’s Manual ICHECK(1M)

DIAGNOSTICS
For duplicate blocks and bad blocks (which lie outside the
file system) icheck announces the difficulty, the i-number,
and the kind of block involved. If a read error is encoun-
tered, the block number of the bad block is printed and
icheck considers it to contain 0. Bad freeblock' means that a block number outside the available space was encoun- tered in the free list. n dups in free’ means that n
blocks were found in the free list which duplicate blocks
either in some file or in the earlier part of the free list.

BUGS
Since icheck is inherently two-pass in nature, extraneous
diagnostics may be produced if applied to active file sys-
tems.
It believes even preposterous super-blocks and consequently
can get core images.

man ncheck

NCHECK(1M) UNIX Programmer’s Manual NCHECK(1M)

NAME
ncheck - generate names from i-numbers

SYNOPSIS
ncheck [ -i numbers ] [ -a ] [ -s ] [ filesystem ]

DESCRIPTION
Ncheck with no argument generates a pathname vs. i-number
list of all files on a set of default file systems. Names
of directory files are followed by /.'. The -i option reduces the report to only those files whose i-numbers fol- low. The -a option allows printing of the names .’ and
`…', which are ordinarily suppressed. suppressed. The -s
option reduces the report to special files and files with
set-user-ID mode; it is intended to discover concealed vio-
lations of security policy.

  A file system may be specified.

  The report is in no useful order, and probably should be
  sorted.

SEE ALSO
dcheck(1), icheck(1), sort(1)

DIAGNOSTICS
When the filesystem structure is improper, ??' denotes the parent’ of a parentless file and a pathname beginning with
`…’ denotes a loop.


Tobi


http://www.pinkjuice.com/

On this old box,
sudo find / > /dev/null
Takes 61 seconds so I don’t have any hope of avoiding the C code.

How long would you expect an enumeration of the names of every file and
directory visible on the system to take?

Depends on whether you are walking the directory tree or just scanning raw
file system information.

(Note that the fact that you are redirecting stdout to /dev/null in your
shell session is not apparent to find(1) so it can’t do the obvious
optimization.)

Redirecting was just to avoid cluttering up the timing with screen issues.

It got me wondering, though. Are the unix file inquiry tools just really
slow on Mac OSX or are they slow everywhere?

To my experience (I’m not an OS/X user) the unix command-line tools are
not slow; though that statement predicates knowledge regarding the use
of the right tool for the task at hand [0].

“Slow” as in “takes an amount of time perceptible to the user.”

Did Unix sacrifice efficiency for portability?

I’ve learned of list that some other *nixen have optimized equivalents of
find but they are always called something other than ‘find’.

To follow up on an idea suggested in the previous thread, maybe Ruby should
have a ‘scan’ or something similar that behaves like find but uses the fast
form if it’s available. Scan could also be mixed into Dir. Any interest?

···

On 6/14/02 3:34 PM, “Rick Bradley” rick@rickbradley.com wrote:

Cogito Ergo Spud. - I think therefore I yam.

I think what you’d find is that the first time you did
SELECT * FROM FILES
it would be about as slow as running “find -print /” the first time.

Both are views on quite efficient database systems.

The main difference where the RDBMS might gain a bit of extra speed
would be in that it would likely collect file data into a somewhat
more compact set of disk blocks.

Modern developments have involved a lot of convergence between the way
filesystems work and the way relational databases work…

···

A long time ago, in a galaxy far, far away, “Daniel P. Zepeda” daniel@zepeda-zone.net wrote:

I guess my real contribution to this thread: I recently saw a post about
Ruby FAM. I thought it would be cool to write a tool that kept the
filesystem information in an RDBMS, like PostgreSQL, using FAM to keep it
updated, and also giving the ability to update the actual filesystem when
the database is updated. It would be really fast and easy to use with a
lot more features when compared to slocate. I’ve started to give it a
shot, but what do you all think?


(reverse (concatenate 'string “gro.gultn@” “enworbbc”))
http://www3.sympatico.ca/cbbrowne/nonrdbms.html
“Never make any mistaeks.” (Anonymous, in a mail discussion about to a
kernel bug report.)

Indeed. For my whole system of 280533 files and 17820 (approx 35GB)
directories 0.2 seconds works out to be about 700ns per file, or 11000ns
(.011ms) per directory, whichever way you want to count. This seems really
quick in terms of IO, where my guess would be that at the least 25% of the
accesses would be not cached and not contiguous to any other accesses happening
around the same time. Not to mention overhead of each OS call (esp if
something else is going on on the system), permissions checking on every file
you access, etc.

Even if my collective hd space is larger than the one the original poster was
talking about, the numbers for it seem pretty reasonable for today’s computer
users, so I’m taking 200ms as a number that is desirable, rather than one that
was actually gotten on my system.

‘time find / > /dev/null’ on all of this takes about 3sec after running it once
and probably a minute or so (I didn’t really time it) when I haven’t run it
recently. Sure both could be optimised down perhaps as much as an order of
magnitude, but then you have some specialized tool that does really quick inode
walks, not anything particularly useful to the world. Find is meant to be
useful to the world. So perhaps it is a performance tradeoff; generality vs
efficiency.

I personally don’t think it loses much. Anyone wanting to quickly get this
sort of information (a complete list of every file on the hd) on a regular
basis should either cache it and update the cache every night or hour or
whatever (a la locate/updatedb), accept the penalty of having your entire
directory structure cached all the time, or wait 10 years and hope that file
counts/fs complexity don’t scale up with IO speed.

-kyle

···

On Sat, Jun 15, 2002 at 01:26:11PM +0900, Daniel P. Zepeda wrote:

I think we’re talking about apples and oranges here. Your code walks the
entire filesystem tree in two tenths of a second? I think not.


moved

Mit diesem Zeichen bann’Ich deinen Zauber!
(Wagner)

unix newby failing miserably here:

[localhost:DLs/compressed/opengl] user% ls
ChangeLog README.EUC glu.c ogl.c rbogl.h
MANIFEST extconf.rb glut.c rbogl.c sample
[localhost:DLs/compressed/opengl] user% ruby extconf.rb
checking for () in -lGL… yes
checking for () in -lGLU… yes
creating Makefile
checking for XAllowDeviceEvents() in -lXi… no
checking for () in -lglut… yes
creating Makefile
[localhost:DLs/compressed/opengl] user% make
Now Making glut extend module
gcc -fno-common -no-cpp-precomp -flat_namespace -pipe -no-precomp -I.
-I. -I/usr/local/lib/ruby/1.6/powerpc-darwin5.3 -I.
-I/usr/local/include -I/usr/X11R6/include -c -o glut.o glut.c
make[1]: gcc: Command not found
make[1]: *** [glut.o] Error 127
make: *** [glut.bundle] Error 2
[localhost:DLs/compressed/opengl] user% ls
ChangeLog Makefile Makefile.ogl extconf.rb glut.c
ogl.c rbogl.h
MANIFEST Makefile.glut README.EUC glu.c mkmf.log
rbogl.c sample

what am i doing wrong?

I’ve learned of list that some other *nixen have optimized equivalents of
find but they are always called something other than ‘find’.

Define “equivalents”. If by that we mean “return lists of filenames
matching a substring request” then there are typically optimized
variants, and one can imagine a number of other optimizations for the
“where’s my file?” crowd. However, find is much more powerful than
this, and this power is the reason for its continued presence in
(?:/usr)?/bin. Think of find as the Swiss Army Knife for traversing,
selecting, and acting upon filesystem nodes.

Here are a few of the many find invocations from my bash history over
the past day or so:

cat /tmp/missing | sed ‘s/^(.*)/echo “\1”; find split_new -type f -exec grep -l “\1” {} ;/’ | sh -s

find split_new -type f -name ‘*.new’ -exec perl -i.orig -pne ‘$_.= “>” if(m:<par/$:)’ {} ;

find merge -newer timestamp -exec cp {} …/complete ; ; touch timestamp

That’s ignoring the less frequent usage of [acm]time, user, perm, prune,
etc., arguments that have expired from my history, but which are
littered about crontabs far and wide.

[ forgive the senseless use of ‘cat’ – it is easier to prepend the
input source when building command-lines than to worry about the
wasted process; oh, and forgive that perl snippet – I’m still
getting acclimated to Ruby… ]

Rick

···


http://www.rickbradley.com MUPRN: 950 (77F/82F)
> regex function like
random email haiku | this: prefix_index.php the file,
> that should be cached.

I think we’re talking about apples and oranges here. Your code walks the
entire filesystem tree in two tenths of a second? I think not.

Indeed. For my whole system of 280533 files and 17820 (approx 35GB)
directories 0.2 seconds works out to be about 700ns per file, or 11000ns
(.011ms) per directory, whichever way you want to count. This seems really
quick in terms of IO, where my guess would be that at the least 25% of the
accesses would be not cached and not contiguous to any other accesses
happening
around the same time. Not to mention overhead of each OS call (esp if
something else is going on on the system), permissions checking on every file
you access, etc.

Guys, I said it was HFS+. The file system is a B*-tree in memory. And the
code doesn’t just walk it, it searches it, recording the path for files
which match the search criteria. In this case the search criterion is simply
whether the user running the search has execute permission for the file,
either because she owns it and the owner execute bit is set or because she
is a member of the owner’s group and the group execute bit is set or because
anyone can execute the file.

‘time find / > /dev/null’ on all of this takes about 3sec after running it
once

Doesn’t get any faster here. It must be just a cache size issue because it
gets a lot faster for any given subdirectory.

···

On 6/16/02 11:22 AM, “Kyle Rawlins” rawlins@cs.umass.edu wrote:

On Sat, Jun 15, 2002 at 01:26:11PM +0900, Daniel P. Zepeda wrote:


As an adolescent I aspired to lasting fame, I craved factual certainty, and
I thirsted for a meaningful vision of human life - so I became a scientist.
This is like becoming an archbishop so you can meet girls. -Matt Cartmill,
anthropology professor and author (1943- )

A long time ago, in a galaxy far, far away, “Daniel P. Zepeda”
daniel@zepeda-zone.net wrote:> I guess my real contribution to this
thread: I recently saw a post about> Ruby FAM. I thought it would be
cool to write a tool that kept the> filesystem information in an RDBMS,
like PostgreSQL, using FAM to keep it> updated, and also giving the
ability to update the actual filesystem when> the database is updated.
It would be really fast and easy to use with a> lot more features when
compared to slocate. I’ve started to give it a> shot, but what do you
all think?

I think what you’d find is that the first time you did
SELECT * FROM FILES
it would be about as slow as running “find -print /” the first time.

I agree. But I think you missed the point. It would only have to traverse
the filesystem once, on installation, then after that FAM would keep it
updated.

···

On Sat, 15 Jun 2002 14:34:17 +0900 Christopher Browne cbbrowne@acm.org wrote:

Both are views on quite efficient database systems.

The main difference where the RDBMS might gain a bit of extra speed
would be in that it would likely collect file data into a somewhat
more compact set of disk blocks.

Modern developments have involved a lot of convergence between the way
filesystems work and the way relational databases work…

(reverse (concatenate 'string “gro.gultn@” “enworbbc”))
http://www3.sympatico.ca/cbbrowne/nonrdbms.html
“Never make any mistaeks.” (Anonymous, in a mail discussion about to a
kernel bug report.)


“Daniel P. Zepeda” <daniel@z,e,p,e,d,a,-,z,o,n,e.net>
(Remove commas for address)

Hi,

···

At Sat, 15 Jun 2002 12:45:21 +0900, ccos wrote:

make[1]: gcc: Command not found

You haven’t installed C compiler.


Nobu Nakada

No ‘gcc’ in your path? If it’s Darwin you’re using, try something like:
sudo ln -s /usr/bin/cc /usr/local/bin/gcc

Hope this helps,

Alwyn

···

In EDF421EF-8011-11D6-BEEF-000393722276@bigpond.com ccos wrote:

make[1]: gcc: Command not found

Rick Bradley wrote:

However, find is much more powerful than
this

Funny thing:

man find

[…] UNIX Programmer’s Manual
[…]
BUGS
The syntax is painful.

Tobi

···


http://www.pinkjuice.com/

Chris Gehlker gehlker@fastq.com writes:

Guys, I said it was HFS+. The file system is a B*-tree in
memory. And the code doesn’t just walk it, it searches it, recording
the path for files which match the search criteria. In this case the
search criterion is simply whether the user running the search has
execute permission for the file, either because she owns it and the
owner execute bit is set or because she is a member of the owner’s
group and the group execute bit is set or because anyone can execute
the file.

Right, but find does not know/care what the filesystem type is –
using the regular user-mode file i/o interface isn’t going to give you
access to the B-trees. You could perhaps write an extention to find
which uses this API, but it would be MacOS specific. (or rather,
filesystem specific)

It’s not that the tools are slow, it’s that find does not use any sort
of API which allows for direct access to the filesystem information.

ttyl,

···


Josh Huber

Centuries ago, Nostradamus foresaw when “Daniel P. Zepeda” daniel@zepeda-zone.net would write:

A long time ago, in a galaxy far, far away, “Daniel P. Zepeda”
daniel@zepeda-zone.net wrote:> I guess my real contribution to this
thread: I recently saw a post about> Ruby FAM. I thought it would be
cool to write a tool that kept the> filesystem information in an RDBMS,
like PostgreSQL, using FAM to keep it> updated, and also giving the
ability to update the actual filesystem when> the database is updated.
It would be really fast and easy to use with a> lot more features when
compared to slocate. I’ve started to give it a> shot, but what do you
all think?

I think what you’d find is that the first time you did
SELECT * FROM FILES
it would be about as slow as running “find -print /” the first time.

I agree. But I think you missed the point. It would only have to traverse
the filesystem once, on installation, then after that FAM would keep it
updated.

No, I’m not missing the point. I’m adding one that you seem to be missing.

In both cases, it would be necessary to pull a similar number of
blocks of data from disk, because in both cases, there is a database
being treated as a directory structure.

In the case of the SQL database, there might be some savings in
time, from two persectives:

a) The set of disk blocks might be somewhat more compact;

b) If a selection criterion was being used that would allow only
selecting a tiny subset, then only part of the data would get
searched.

Of course, b) isn’t the case in “SELECT * FROM FILES”.

If an SQL database is being used, then perhaps the filesystem
doesn’t have to get traversed, but that certainly doesn’t eliminate
the need to traverse the database.

And if the database contains much the same set of data (which it
does), organized in a manner not vastly dissimilar (which would be the
case), then the differences in behaviour won’t be vastly different.

Creating a DBMS doesn’t magically save you time. It comes at a cost.
Querying a database comes at a cost too.

···

On Sat, 15 Jun 2002 14:34:17 +0900 > Christopher Browne cbbrowne@acm.org wrote:

(reverse (concatenate 'string “moc.enworbbc@” “enworbbc”))
http://www3.sympatico.ca/cbbrowne/
Rules of the Evil Overlord #168. “I will plan in advance what to do
with each of my enemies if they are captured. That way, I will never
have to order someone to be tied up while I decide his fate.”
http://www.eviloverlord.com/

i have installed it.
it may not be called gcc though?

···

On Saturday, June 15, 2002, at 02:26 PM, nobu.nokada@softhome.net wrote:

Hi,

At Sat, 15 Jun 2002 12:45:21 +0900, > ccos wrote:

make[1]: gcc: Command not found

You haven’t installed C compiler.


Nobu Nakada

yeah it is darwin, and the compiler is cc,
i will try what you have suggested.
thanks everybody

···

On Saturday, June 15, 2002, at 04:14 PM, Alwyn wrote:

In EDF421EF-8011-11D6-BEEF-000393722276@bigpond.com ccos wrote:

make[1]: gcc: Command not found

No ‘gcc’ in your path? If it’s Darwin you’re using, try something like:
sudo ln -s /usr/bin/cc /usr/local/bin/gcc

Hope this helps,

Alwyn

Funny thing:

man find

[…] UNIX Programmer’s Manual
[…]
BUGS
The syntax is painful.

BUGS
The special characters used by find are also special characters to many
shell programs. In particular, the characters *'', [‘’, ``]‘’,
``?‘’, ``(‘’, ``)‘’, ``!‘’, ``'’ and ``;‘’ may have to be escaped from
the shell.

 As there is no delimiter separating options and file names or file names
 and the expression, it is difficult to specify files named -xdev or !.
 These problems are handled by the -f option and the getopt(3) ``--'' con-
 struct.

 The -delete primary does not interact well with other options that cause
 the filesystem tree traversal options to be changed.

Yours must be more recent than mine (as that seems to be a more succint
way of saying the above). :wink:

Rick

···


http://www.rickbradley.com MUPRN: 521 (84F/88F)
> and that they better
random email haiku | get a move on. Oh yeah, speed
> is important too.