Large File Reading and Writing

Hi,

A project that I've currently got at work consists of reading large files
and writing another larger file. Unfortunately the machine I'm on is pretty
lame and it runs out of RAM pretty quick and so takes a long time. (This is
currently in perl and I'm not keen to play with it in perl.)

I have had a look around the docs and searched the mailing list ( a quick
search ) for how to do buffered reading and writing but I've come up with
nothing.

I have a bunch of text files. I want to read some lines, and then write a
line, the move on to the next line etc but I only want to have a few lines
in memory at once. Any clues as to how to achive this?

Cheers
Daniel

Daniel N wrote:

Hi,

A project that I've currently got at work consists of reading large files
and writing another larger file. Unfortunately the machine I'm on is pretty
lame and it runs out of RAM pretty quick and so takes a long time. (This is
currently in perl and I'm not keen to play with it in perl.)

I have had a look around the docs and searched the mailing list ( a quick
search ) for how to do buffered reading and writing but I've come up with
nothing.

I have a bunch of text files. I want to read some lines, and then write a
line, the move on to the next line etc but I only want to have a few lines
in memory at once. Any clues as to how to achive this?

Cheers
Daniel

Something like this maybe...

history =
ARGF.each do |line|
   history << line
   if enough_history_to_generate_some_output
     write_output
     history.clear
   end
end

The ARGF thingy is explained in 'ri IO':

      The global constant ARGF (also accessible as $<) provides an
      IO-like stream which allows access to all files mentioned on the
      command line (or STDIN if no files are mentioned). ARGF provides
      the methods #path and #filename to access the name of the file
      currently being read.

You can of course open a file by name from your code.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Thanx. I'm not sure that my head is really around it though...

Pls see inline

Something like this maybe...

history =
ARGF.each do |line|
   history << line
   if enough_history_to_generate_some_output
     write_output
     history.clear
   end
end

Does the output file not stay in memory? In my case the output file is
almost a concatenation of large files so once I've written a line I don't
really want to keep that line in memory.

The ARGF thingy is explained in 'ri IO':

      The global constant ARGF (also accessible as $<) provides an
      IO-like stream which allows access to all files mentioned on the
      command line (or STDIN if no files are mentioned). ARGF provides
      the methods #path and #filename to access the name of the file
      currently being read.

You can of course open a file by name from your code.

If I have a named file and I open it. I think it will be easier if I open
specify the files from withing ruby by feeding it only a directory. Once I
have my list of files if I open a file

File.open( "my_file", "r" )

Is there a way to buffer this input so that the entire file is not read?

Sorry if this is implicit in your last reply but I don't understand it if it
is.

Cheers

···

On 10/13/06, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:

Daniel N wrote:

Thanx. I'm not sure that my head is really around it though...

Pls see inline

Something like this maybe...

history =
ARGF.each do |line|
   history << line
   if enough_history_to_generate_some_output
     write_output
     history.clear
   end
end

Does the output file not stay in memory? In my case the output file is
almost a concatenation of large files so once I've written a line I don't
really want to keep that line in memory.

Instead of the line

        write_output

let's say you have something like

        output_line = ...
        puts output_line

Each time these lines are executed, you have a variable that refers to the _current_ line of output, but there is no reference to the string that was printed last time around. This means that the garbage collector can reclaim that space if it needs to. So the whole output file need not be kept in memory.

The ARGF thingy is explained in 'ri IO':

      The global constant ARGF (also accessible as $<) provides an
      IO-like stream which allows access to all files mentioned on the
      command line (or STDIN if no files are mentioned). ARGF provides
      the methods #path and #filename to access the name of the file
      currently being read.

You can of course open a file by name from your code.

If I have a named file and I open it. I think it will be easier if I open
specify the files from withing ruby by feeding it only a directory. Once I
have my list of files if I open a file

File.open( "my_file", "r" )

Is there a way to buffer this input so that the entire file is not read?

Sure. If you use IO.gets (or IO.each, as above), then only one line at a time is read.

  File.open( "my_file", "r" ) do |f|
    f.each do |line|
      ... # do something with line
    end
  end

···

On 10/13/06, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Great Thanx Joel. That clears it up for me.