Slow IO

This works very quickly with sets of files that are in directories
containing (say) 90K documents or fewer. But when there are 200k +
documents in the directories, it begins a substantial amount of time.

Suffice it to say that few file systems are optimized for the case of

200k files per directory.

Fair enough. But we sometimes get directories with several hundred
thousand records.

From your example, I’m guessing that you’re using Windows which I
don’t know much about. But my advice is to restructure your program
to using more, smaller directories.

We don’t create the input files or their structure–we are simply
processing them. We get media that contain them with text files
listing the files and their relationships. If we moved the files, we’d
have modify the description files, which would be a time consuming and
error prone task.

We’re using Redhat’s most recent linux (on a Dual Athlon with 2GB RAM).
Why would you think I was using Windows from the example?

···

On Feb 13, 2004, at 1:04 PM, J.Herre wrote:

On Feb 13, 2004, at 9:16 AM, David King Landrith wrote:


David King Landrith
(w) 617.227.4469x213
(h) 617.696.7133

One useless man is a disgrace, two
are called a law firm, and three or more
become a congress – John Adams

public key available upon request

Sorry, no offense intended. I read this as a forward slash, i.e. a
windows command line switch. My bad.

system(“echo \f >> #{outputFile}”)

It’s possible that you’re wasting time in the garbage collector. Try
using a static buffer:

buf = input.sysread( count, buf )

···

On Feb 13, 2004, at 10:55 AM, David King Landrith wrote:

We’re using Redhat’s most recent linux (on a Dual Athlon with 2GB
RAM). Why would you think I was using Windows from the example?

Hi,

At Sat, 14 Feb 2004 03:55:05 +0900,
David King Landrith wrote in [ruby-talk:92806]:

We’re using Redhat’s most recent linux (on a Dual Athlon with 2GB RAM).

On what filesystem did you execute it? EXT[23]-fs is not
efficient for a huge directory.

···


Nobu Nakada

with htree its perfectly fine. try a tune2fs

···

On Saturday 14 February 2004 09:49, nobu.nokada@softhome.net wrote:

Hi,

At Sat, 14 Feb 2004 03:55:05 +0900,

David King Landrith wrote in [ruby-talk:92806]:

We’re using Redhat’s most recent linux (on a Dual Athlon with 2GB RAM).

On what filesystem did you execute it? EXT[23]-fs is not
efficient for a huge directory.


When in doubt, use brute force.
– Ken Thompson

Yes. If you have a partition that you can do it on (or you can put
another drive into the machine to make one), try a reiserfs partition for
your data and see if that changes your results in a positive way.

Kirk Haines

···

On Sat, 14 Feb 2004 nobu.nokada@softhome.net wrote:

At Sat, 14 Feb 2004 03:55:05 +0900,
David King Landrith wrote in [ruby-talk:92806]:

We’re using Redhat’s most recent linux (on a Dual Athlon with 2GB RAM).

On what filesystem did you execute it? EXT[23]-fs is not
efficient for a huge directory.

We’re using Reiserfs already.

···

On Feb 14, 2004, at 9:45 AM, Kirk Haines wrote:

On Sat, 14 Feb 2004 nobu.nokada@softhome.net wrote:

At Sat, 14 Feb 2004 03:55:05 +0900,
David King Landrith wrote in [ruby-talk:92806]:

We’re using Redhat’s most recent linux (on a Dual Athlon with 2GB
RAM).

On what filesystem did you execute it? EXT[23]-fs is not
efficient for a huge directory.

Yes. If you have a partition that you can do it on (or you can put
another drive into the machine to make one), try a reiserfs partition
for
your data and see if that changes your results in a positive way.

Kirk Haines


David King Landrith
(w) 617.227.4469x213
(h) 617.696.7133

Generic quotes offend nobody.
–Unknown

public key available upon request

Well, dang. Did the threads about your code possibly hitting garbage
collection help out? This is an interesting one. Please let us know what
you figure out.

Thanks,

Kirk Haines

···

On Sun, 15 Feb 2004, David King Landrith wrote:

We’re using Reiserfs already.

I had planned on going ahead and simply writing something in C (I’ve
already written about 2000 lines of C code for our program to optimize
routine operations and save memory). I’d probably just read the file
into an mmap and write it out into another file, since this would be
the fastest way to do the IO.

At any rate, just before sitting down to write this, I had our sysadmin
upgraded our ruby install to 1.8.1, since I wanted to move the latest C
code from our dev sever, and it used the new rb_str_buf (this was for a
portion of our program entirely unrelated to our IO issue). I then
went to make one last run to get a “before” benchmark, and I discovered
that ruby 1.8.1 fixed the problem.

I may still write an ad hoc C function to do a direct copy in C just to
see how fast I can make it, but that would be a spare time project. I
haven’t had time to dig any deeper.

And now for something completely different…

Incidentally, I grabbed the Judy array code off of the raa and started
playing with it. I haven’t taken a close look at the C code that
implements the JudyHash, since the JudySL is functionally identical to
what people use hashes for 99.999% of the time (viz., most people
created hashes that use string keys of a reasonable length) and is 11%
to 15% faster than a ruby hash.

I spent a few hours cleaning up the C code in the raa ruby to judy
interface code, and I was able to get about an 8% improvement in both
the JudySL (which I’ve renamed “KeyedList” for use within ruby) and
JudyL (which I’ve renamed “NumberedList”–its essentially a sparse
array with non-continuous indices).

At any rate, the JudyArrays are really, really fast, and they scale
like the dickens (we’re dealing with hashes that have hundreds of
thousands of members. I’d like to be able to move to the millions, and
JudyArrays look like they’ll get us there). Moreover, the HP Judy
libraries (as well as the updates from the debian folks) are
exceptionally well designed.

Has any one else have any luck playing with the Judy libraries? Is
there any reason not to make them part of the standard library?

Lastly, the benchmarks I’ve been running on the JudyArrays indicate
that ruby 1.8.x hashes are much more scalable (i.e., lookup and insert
times increase linearly with size) than ruby 1.6.x hashes. Is this
correct?

···

On Feb 15, 2004, at 4:49 AM, Kirk Haines wrote:

Well, dang. Did the threads about your code possibly hitting garbage
collection help out? This is an interesting one. Please let us know
what
you figure out.


David King Landrith
(w) 617.227.4469x213
(h) 617.696.7133

Life is tough. It’s even tougher if you’re
an idiot. --John Wayne

public key available upon request