I’m having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.
The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:
docInfo.each do | pageInfo |
pageNo = pageInfo.pageNo
start = 0
count = chunk
begin
# open source
input = File.open(pageInfo.inputFile)
fileSize = input.stat.size
# open destination if not already open
unless isOpen
output = File.open(outputFile, fmode, 0664)
isOpen = true
end
# loop to make sure that no
while start < fileSize
count = (fileSize - start) if (start + chunk) > fileSize
output.syswrite(input.sysread(count))
start += count
end
output.syswrite("\f")
ensure
begin
input.close
rescue Exception => err
STDERR << "WARNING: couldn't close #{inputFile}\n"
end
end
end
ensure
begin
output.close if isOpen
rescue Exception
STDERR << "WARNING: couldn’t close #{outputFile}\n"
end
end
–end code
Suffice it to say that few file systems are optimized for the case of >
200k files per directory. From your example, I’m guessing that you’re
using Windows which I don’t know much about. But my advice is to
restructure your program to using more, smaller directories.
Seriously, that’s a lot of files…
-J
···
On Feb 13, 2004, at 9:16 AM, David King Landrith wrote:
This works very quickly with sets of files that are in directories
containing (say) 90K documents or fewer. But when there are 200k +
documents in the directories, it begins a substantial amount of time.
Date: Sat, 14 Feb 2004 02:16:05 +0900
From: David King Landrith dlandrith@mac.com
Newsgroups: comp.lang.ruby
Subject: slow IO
Hello everyone,
I’m having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.
The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:
begin
# open source
input = File.open(pageInfo.inputFile)
fileSize = input.stat.size
# open destination if not already open
unless isOpen
output = File.open(outputFile, fmode, 0664)
isOpen = true
end
while start < fileSize
count = (fileSize - start) if (start + chunk) > fileSize