Slow IO

Hello everyone,

I’m having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.

The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:

system(“cat #{input_file} >> #{outputFile}”)
system(“echo \f >> #{outputFile}”)

It seems to me that there is no way that this should be faster than
doing a sysread.

Any help would be appreciated.

Best,

Dave

– begin code (This has been cleaned a bit and changed to protect the
innocent)

docInfo object is a wrapper for pages array with some additional info

outputFile = docInfo.outputFile
output = nil

isOpen = false

chunk = (10240 * 2) # 20k
fmode = File::CREAT|File::TRUNC|File::WRONLY

begin

 docInfo.each do | pageInfo |
     pageNo = pageInfo.pageNo
     start  = 0
     count  = chunk

     begin
         # open source
         input    = File.open(pageInfo.inputFile)
         fileSize = input.stat.size

         # open destination if not already open
         unless isOpen
             output = File.open(outputFile, fmode, 0664)
             isOpen = true
         end

		# loop to make sure that no
         while start < fileSize
             count = (fileSize - start) if (start + chunk) > fileSize
             output.syswrite(input.sysread(count))
             start += count
         end
         output.syswrite("\f")

     ensure
         begin
             input.close
         rescue Exception => err
             STDERR << "WARNING: couldn't close #{inputFile}\n"
         end
     end
 end

ensure
begin
output.close if isOpen
rescue Exception
STDERR << "WARNING: couldn’t close #{outputFile}\n"
end
end
–end code

Suffice it to say that few file systems are optimized for the case of >
200k files per directory. From your example, I’m guessing that you’re
using Windows which I don’t know much about. But my advice is to
restructure your program to using more, smaller directories.

Seriously, that’s a lot of files…

-J

···

On Feb 13, 2004, at 9:16 AM, David King Landrith wrote:

This works very quickly with sets of files that are in directories
containing (say) 90K documents or fewer. But when there are 200k +
documents in the directories, it begins a substantial amount of time.

Date: Sat, 14 Feb 2004 02:16:05 +0900
From: David King Landrith dlandrith@mac.com
Newsgroups: comp.lang.ruby
Subject: slow IO

Hello everyone,

I’m having an IO problem. The code below is designed to read a series
of text files (most of these are about 8k) and output them into a
single text file separated by form feed characters. This works very
quickly with sets of files that are in directories containing (say) 90K
documents or fewer. But when there are 200k + documents in the
directories, it begins a substantial amount of time.

The main problem seems to be that sysread seems to be painfully slow
with files in very large directories. At this point, it would be
faster to do the following:

system(“cat #{input_file} >> #{outputFile}”)
system(“echo \f >> #{outputFile}”)

It seems to me that there is no way that this should be faster than
doing a sysread.

Any help would be appreciated.

Best,

Dave

– begin code (This has been cleaned a bit and changed to protect the
innocent)

docInfo object is a wrapper for pages array with some additional info

outputFile = docInfo.outputFile
output = nil

isOpen = false

chunk = (10240 * 2) # 20k
fmode = File::CREAT|File::TRUNC|File::WRONLY

begin

 docInfo.each do | pageInfo |
     pageNo = pageInfo.pageNo
     start  = 0
     count  = chunk
       input = output = nil
     begin
         # open source
         input    = File.open(pageInfo.inputFile)
         fileSize = input.stat.size

         # open destination if not already open
         unless isOpen
             output = File.open(outputFile, fmode, 0664)
             isOpen = true
         end

         while start < fileSize
             count = (fileSize - start) if (start + chunk) > fileSize
               buf = input.sysread count
               output.syswrite buf
               buf = nil
             start += count
         end
         output.syswrite("\f")
     ensure
         # you probably want to _know_ if the system is having probs
         # closing files
         input.close if input
         output.close if output
     end
 end

ensure
begin
output.close if isOpen
rescue Exception
STDERR << “WARNING: couldn’t close #{outputFile}\n”
end
end
–end code

alternatively you might be able to use the

open(path) do |f|

end

idiom with ‘output’. i suspect that you were grinding to a halt with too many
output files open (they were never closed)…

-a

···

On Sat, 14 Feb 2004, David King Landrith wrote:

===============================================================================

EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
PHONE :: 303.497.6469
ADDRESS :: E/GC2 325 Broadway, Boulder, CO 80305-3328
URL :: Solar-Terrestrial Physics Data | NCEI
TRY :: for l in ruby perl;do $l -e “print "\x3a\x2d\x29\x0a"”;done
===============================================================================

In Message-Id: 4CA13EBB-5E48-11D8-8E72-000393DC9B9C@mac.com
David King Landrith dlandrith@mac.com writes:

             output.syswrite(input.sysread(count))

Probably here is a problem: many Strings created and discarded may
stimulate GC.

Can you rewrite this to

# buf should be allocated before the loop.
input.sysread(count, buf)
output.syswrite(buf)

and test its performance? Here buf is updated by sysread and no more
extra Strings are created.

Note that this feature is incorporated from version 1.7.x or later

where x > …well, some point :stuck_out_tongue:

···


kjana@dm4lab.to February 14, 2004
Slow and steady wins the race.