Memory management when traversing large filesystem using Find

I am working on a project where I intend to upload >5TB of data to the cloud. I’m looking for some advice and best practice recommendations on how best to traverse a large filesystem without consuming a significant amount of memory. I would like to stay below 1GB, worst case.

Ruby 2.2+
CentOS Linux 7
aws-sdk

Right now I have the following:

Find.find(‘path’) do |local_file|
  # upload to cloud...
end

I’ve moved all variable assignment out of the block (that I can see possible) but the process consumes a significant amount of memory over time, starts swapping, and makes the server unusable.

Is there a better way to do this? Even generating a list of filenames would be hundreds of MBs in size, which is something I’ve considered as well.

Thanks,
Josh

I am working on a project where I intend to upload >5TB of data to the cloud. I’m looking for some advice and best practice recommendations on how best to traverse a large filesystem without consuming a significant amount of memory. I would like to stay below 1GB, worst case.

Ruby 2.2+
CentOS Linux 7
aws-sdk

Right now I have the following:

Find.find(‘path’) do |local_file|
  # upload to cloud...

What's in your "upload to cloud" stage? If you have large
files and you're slurping entire files into memory before
uploading (a common mistake), that would be a big problem.

Large allocations can cause memory fragmentation and really bad
growth. You can also try setting MALLOC_MMAP_THRESHOLD_=131072
(or similar number) for glibc malloc or try building with
jemalloc; but avoiding slurping at all is best.

ref: https://t-t-travails.blogspot.com/2009/05/mr-malloc-gets-schooled.html

Reading Find.find, it calls Dir.entries which wastes memory by
slurping the entire directory contents into memory (and doing it
recursively by going depth-first). If you have a flat directory
structure with lots of files at any level, this is going to be
ugly...

Perhaps try using the find(1) command to limit the number of
files which enters the Ruby process at once:

  IO.popen(%W(find #{path} -print0)) do |rpipe|
    recsep = "\0".freeze # for -print0
    rpipe.each_line(recsep) do |line|
      line.chomp!(recsep)

      # upload to cloud # XXX make sure this doesn't slurp
    end
  end

find(1) seems to slurp entire directory listings into memory,
too, but I expect its internal data overhead is much less than
that of an array of Ruby strings which Find.find keeps.

Is there a better way to do this? Even generating a list of
filenames would be hundreds of MBs in size, which is something
I’ve considered as well.

General rule is that whenever you handle a large amount of data,
stream and process it in small chunks. I always consider the
data I intend to process, first, and build code around that;
never the other way around.

···

Josh Miller <joshua@itsecureadmin.com> wrote:

If the tree is known not to have links, perhaps you could try a manual
traversal in which descending keeps a reference to the parent node to avoid
big stacks. Memory usage would be a function of tree depth, possibly not a
problem.

If the generation of the file with filenames is doable using something
external, then of course the Ruby script would be a trivial line-oriented
program.

···

--
Sent from Gmail Mobile

I was hoping to not have to do this, but this does seem the logical way to progress, given no other options.

Thanks,

Josh Miller
ITSA Consulting, LLC
skype: itsecureadmin
https://itsecureadmin.com/

···

On Apr 15, 2016, at 11:07 AM, Xavier Noria <fxn@hashref.com> wrote:

If the tree is known not to have links, perhaps you could try a manual traversal in which descending keeps a reference to the parent node to avoid big stacks. Memory usage would be a function of tree depth, possibly not a problem.

If the generation of the file with filenames is doable using something external, then of course the Ruby script would be a trivial line-oriented program.

--
Sent from Gmail Mobile

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

I am working on a project where I intend to upload >5TB of data to the cloud. I’m looking for some advice and best practice recommendations on how best to traverse a large filesystem without consuming a significant amount of memory. I would like to stay below 1GB, worst case.

Ruby 2.2+
CentOS Linux 7
aws-sdk

Right now I have the following:

Find.find(‘path’) do |local_file|
# upload to cloud...

What's in your "upload to cloud" stage? If you have large
files and you're slurping entire files into memory before
uploading (a common mistake), that would be a big problem.

The upload to cloud stage includes some logging statements, some checks to see if the file has already been uploaded (allows restarts of the process if stopped for some reason) and then chunking the file up to AWS Glacier, in 1MB chunks if over 1MB (via multipart upload) and all at once if less than 1MB in size - part_size is 1MB:

...
      Find.find(archive_path) do |local_file|

        if Dir.exist?(local_file)

          @logger.debug("#{local_file} not a file, not archiving")

        elsif File.exist?(local_file)

···

On Apr 15, 2016, at 11:58 AM, Eric Wong <normalperson@yhbt.net> wrote:
Josh Miller <joshua@itsecureadmin.com> wrote:

          #
          # check to see if we've already uploaded the file (recent archive index)
          # - this is a hash that was read in from a CSV file before starting the Find.find...
          #
          @logger.info("checking old vault index for this file: #{local_file}")
          if old_index_reference["#{local_file}"] = 1
            @logger.debug("file exists in old vault index: #{local_file}")
            upload_done = 1
          end

          if upload_done < 1
            @logger.info("archiving #{local_file} to #{vault_name})")
            archive_id = self.create_archive(local_file, vault_name, part_size, treehash_client)
      
            #
            # update s3 index which is a CSV file locally written until we’re finished
            # - batch mode when operating on a directory
            #
            if archive_id.eql?("invalid")
              @logger.info("invalid response, not writing: #{local_file},#{archive_id}")
            else
              @logger.debug("index file write: #{local_file},#{archive_id}")
              file.write("#{epoch_time},#{local_file},#{archive_id}\n")
            end

          else
            @logger.info("file already archived, skipping")
          end

        else

          # must be a directory or sym-link
          @logger.debug("#{local_file} not a file, not archiving")

        end

      end
..

...and then the function create_archive would be:

  def create_archive(filename, vault_name, part_size, treehash_client)

    archive_id = "invalid"

    self.create_or_verify_vault(vault_name)

    File.open(filename, 'rb') do |file|
      if file.size > part_size
        @logger.info("File size over #{part_size} bytes, using multipart upload...")

        mpu_create_response = @glacier_client.initiate_multipart_upload({
          account_id: @account_id,
          vault_name: vault_name,
          archive_description: filename,
          part_size: part_size,
        })

        total_parts = file.size.to_f / part_size
        current_part = 1
        range_start = 0
        range_end = 0

        file.each_part do |part|

          @logger.debug("part_size: #{part_size}")
          @logger.debug("part.size: #{part.size}")
          @logger.debug("range_start: #{range_start}")
          @logger.debug("range_end: #{range_end}")

          range_start = part_size * current_part - part_size
          if part.size < part_size
            range_end = range_start + part.size - 1
          else
            range_end = range_start + part_size - 1
          end
          range = "bytes #{range_start.to_s}-#{range_end.to_s}/*"

          @logger.debug("part range: #{range}")

          #
          # add some logic to loop over part uploads and handle timeouts somewhat gracefully
          #
          restart_upload = 1
          while restart_upload > 0
            begin
              part_response = @glacier_client.upload_multipart_part({
                account_id: @account_id,
                vault_name: vault_name,
                upload_id: mpu_create_response.upload_id,
                range: range,
                body: part,
              })
              restart_upload = 0
            rescue Aws::Glacier::Errors::RequestTimeoutException => rte
              @logger.info("Upload timed out, restarting upload.")
              restart_upload = 1
            end
          end
          percent_complete = (current_part.to_f / total_parts.to_f) * 100
          percent_complete = 100 if percent_complete > 100
          percent_complete = sprintf('%.2f', percent_complete.to_f)
          @logger.info("percent complete: #{percent_complete}")
          current_part = current_part + 1

        end # file.each_part do |part|

        checksum = treehash_client.calculate_tree_hash(filename)

        @logger.debug(YAML.dump(checksum))

        mpu_complete_resp = @glacier_client.complete_multipart_upload({
          account_id: @account_id,
          vault_name: vault_name,
          upload_id: mpu_create_response.upload_id,
          archive_size: file.size,
          checksum: checksum,
        })

        archive_id = mpu_complete_resp.archive_id

      elsif file.size > 0 # not multipart, single file upload, but at least greater than 0 size

        @logger.info("File size under #{part_size} bytes, not using multipart upload: #{filename}")

        begin

          upload_archive_resp = @glacier_client.upload_archive({
            vault_name: vault_name,
            account_id: @account_id,
            archive_description: filename,
            body: file,
          })

…exception handling here…omitted for brevity...

        archive_id = upload_archive_resp.archive_id

      else

        @logger.info("File size 0, not archiving: #{filename}")

      end # if file.size > part_size

      @logger.info("uploaded #{filename} to #{vault_name} with archive_id of #{archive_id}")

    end # File.open(filename, 'rb') do |file|

    return archive_id

  end

I’ve over-ridden the File class with:

class File
  def each_part(part_size=PART_SIZE)
    yield read(part_size) until eof?
  end
end

Let me know if a better format would be more appropriate, and all tips / pointers appreciated.

Large allocations can cause memory fragmentation and really bad
growth. You can also try setting MALLOC_MMAP_THRESHOLD_=131072
(or similar number) for glibc malloc or try building with
jemalloc; but avoiding slurping at all is best.

ref: https://t-t-travails.blogspot.com/2009/05/mr-malloc-gets-schooled.html

I will read through this, thanks.

Reading Find.find, it calls Dir.entries which wastes memory by
slurping the entire directory contents into memory (and doing it
recursively by going depth-first). If you have a flat directory
structure with lots of files at any level, this is going to be
ugly...

Perhaps try using the find(1) command to limit the number of
files which enters the Ruby process at once:

IO.popen(%W(find #{path} -print0)) do |rpipe|
   recsep = "\0".freeze # for -print0
   rpipe.each_line(recsep) do |line|
     line.chomp!(recsep)

     # upload to cloud # XXX make sure this doesn't slurp
   end
end

find(1) seems to slurp entire directory listings into memory,
too, but I expect its internal data overhead is much less than
that of an array of Ruby strings which Find.find keeps.

Is there a better way to do this? Even generating a list of
filenames would be hundreds of MBs in size, which is something
I’ve considered as well.

General rule is that whenever you handle a large amount of data,
stream and process it in small chunks. I always consider the
data I intend to process, first, and build code around that;
never the other way around.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

Thanks a lot,

Josh Miller
ITSA Consulting, LLC
https://itsecureadmin.com/

I ran find(1) in my laptop on the root directory (~3M files):

    root@yeager:~ $ time find / -type f | wc -l
    find: /dev/fd/3: Not a directory
    find: /dev/fd/4: Not a directory
      2693493

    real 4m19.769s
    user 0m6.795s
    sys 2m4.046s

and htop(1) reported no significant memory usage.

If the tree is known not to have links, perhaps you could try a manual

traversal in which descending keeps a reference to the parent node to avoid
big stacks. Memory usage would be a function of tree depth, possibly not
a problem.

I believe this simple recursive method would be a way to do it like that
(where the "reference to parent" is just in the call stack):

    def process_directory(dir)
      Dir.foreach(dir) do |entry|
        next if entry == '.' || entry == '..'

        entry = File.join(dir, entry)

        if File.directory?(entry)
          process_directory(entry)
        elsif File.file?(entry)
          # upload
        end
      end
    rescue Errno::ENOTDIR
      puts $!
    end

    process_directory(ARGV[0])

Note it would loop indefinitely if links introduce cycles.

If I am not mistaken that would consume memory as a function of the depth
of the tree and do not depend on the number of entries per node. This is so
because Dir.foreach is implemented using readdir(3), which is a C iterator.
Recursion keeps the iterators in the walked branches in place.

Just a followup for curiosity, shelling out to find(1) seems like the best
option.

···

On Fri, Apr 15, 2016 at 8:07 PM, Xavier Noria <fxn@hashref.com> wrote:

          if old_index_reference["#{local_file}"] = 1

Missing '=' there or is that an intentional assignment?
If old_index_reference is an in-memory Hash, that would be
a problem for memory usage if you have many files.

class File
  def each_part(part_size=PART_SIZE)
    yield read(part_size) until eof?
  end
end

You can probably recycle the IO#read buffer instead of
allocating a new one with every call. That would reduce GC
invocations and likely reduce fragmentation:

  def each_part(part_size = PART_SIZE, buffer = String.new)
    yield read(part_size, buffer) until eof?
  end

If recycling the buffer corrupts data, there could be a problem
with something else (e.g. @glacier_client) holding onto a buffer
reference for longer than it needs to.

Let me know if a better format would be more appropriate, and
all tips / pointers appreciated.

See if you can reproduce the problem by isolating certain parts
of the process:

1) try skipping the uploads entirely but do all local processing
   (which might tell you old_index_reference is eating memory)

2) try a small directory tree with only a handful of big files.
   This could tell you if there's something holding onto the
   parts longer than it needs to.

3) try a giant directory with thousands/millions of tiny files
   (similar to 1)

...

And I'd definitely be reading the code behind @glacier_client,
treehash_client, and every other thing you use to make sure it's
not slurping or holding onto references which prevents GC from
reclaiming memory.

···

Josh Miller <joshua@itsecureadmin.com> wrote:

I ran find(1) in my laptop on the root directory (~3M files):

    root@yeager:~ $ time find / -type f | wc -l
    find: /dev/fd/3: Not a directory
    find: /dev/fd/4: Not a directory
      2693493

    real 4m19.769s
    user 0m6.795s
    sys 2m4.046s

and htop(1) reported no significant memory usage.

I wonder what that might look like with 5TB+ of data? :slight_smile:

Thanks,

Josh Miller
ITSA Consulting, LLC
skype: itsecureadmin
https://itsecureadmin.com/

···

On Apr 15, 2016, at 12:07 PM, Xavier Noria <fxn@hashref.com> wrote:

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

If the tree is known not to have links, perhaps you could try a manual traversal in which descending keeps a reference to the parent node to avoid big stacks. Memory usage would be a function of tree depth, possibly not a problem.

I believe this simple recursive method would be a way to do it like that (where the "reference to parent" is just in the call stack):

    def process_directory(dir)
      Dir.foreach(dir) do |entry|
        next if entry == '.' || entry == '..'

        entry = File.join(dir, entry)

        if File.directory?(entry)
          process_directory(entry)
        elsif File.file?(entry)
          # upload
        end
      end
    rescue Errno::ENOTDIR
      puts $!
    end

    process_directory(ARGV[0])

Wouldn’t this exhibit the same behavior as Find.find as the foreach iterates over the directory and potentially adds |entry| to the stack each time?

Thanks,
Josh

···

On Apr 15, 2016, at 1:53 PM, Xavier Noria <fxn@hashref.com> wrote:
On Fri, Apr 15, 2016 at 8:07 PM, Xavier Noria <fxn@hashref.com <mailto:fxn@hashref.com>> wrote:

Note it would loop indefinitely if links introduce cycles.

If I am not mistaken that would consume memory as a function of the depth of the tree and do not depend on the number of entries per node. This is so because Dir.foreach is implemented using readdir(3), which is a C iterator. Recursion keeps the iterators in the walked branches in place.

Just a followup for curiosity, shelling out to find(1) seems like the best option.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

Wouldn’t this exhibit the same behavior as Find.find as the foreach

iterates over the directory and potentially adds |entry| to the stack each
time?

Don't think so. If you do some tests you'll see in the foreach version
memory is under control (though much worse than find(1)'s), whereas in the
Find.find version memory grows.

Find.find has a stack to which it pushes all Dir.entries it finds when
traversing. The idea is (pseudocode):

    files = [path]
    while file = files.shift
      yield file
      files.concat(Dir.entries(file)) if file is a directory
    end

So, as you go down, that `files` array can grow quite a lot depending on
the tree. You shift one, but push many, and those many will be kept in
memory and increased until the recursion exhausts that branch. To get the
picture, when you reach the first leaf all your siblings and the siblings
of all your ancestors are in the stack waiting to be processed (mod
details, wrote this off the top of my head).

The C iterator readdir() on the other hand does not work like this. In C
you opendir(), which gives a pointer to a DIR structure that keeps some
information. In particular, the state of an iterator you can ask for the
next element (or NULL). In that version you only keep one item per depth
level so to speak, memory only has the ancestors.

When Dir.foreach is called MRI creates such structure at C level and then
the code just yields whatever readdir() returns in a while loop. You see,
it is a different approach.

I suspect this is also why find(1) memory usage seems to be also quite low.
That kind of thing is find(1)'s job, so shelling out is what I'd do in an
extreme case like yours.

···

On Fri, Apr 15, 2016 at 11:02 PM, Josh Miller <joshua@itsecureadmin.com> wrote:

         if old_index_reference["#{local_file}"] = 1

Missing '=' there or is that an intentional assignment?
If old_index_reference is an in-memory Hash, that would be
a problem for memory usage if you have many files.

That was not intentional, thanks for pointing that out.

See if you can reproduce the problem by isolating certain parts
of the process:

1) try skipping the uploads entirely but do all local processing
  (which might tell you old_index_reference is eating memory)

2) try a small directory tree with only a handful of big files.
  This could tell you if there's something holding onto the
  parts longer than it needs to.

3) try a giant directory with thousands/millions of tiny files
  (similar to 1)

I took your advice here and started disabling large chunks of the process and was able to isolate the issue to the logger config that I had established outside the Find.find (bad assumption on my part there).

logger = Logger.new(CONFIG['system']['log_file'], 7, 1073741824)

I changed the log size value to 10MB and the memory issue has disappeared. This was using logger (1.2.8).

Thanks again, Eric and Xavier.

- Josh

···

On Apr 15, 2016, at 4:07 PM, Eric Wong <normalperson@yhbt.net> wrote:
Josh Miller <joshua@itsecureadmin.com> wrote:

No problem. Now I'm curious what CONFIG['system']['log_file']
is. I use logger from the Ruby stdlib (1.2.7) but never noticed
memory problems. I didn't even know there's a 1.2.8, but it
would be good to get this fixed if it's indeed a problem with
logger.

The following exhibits stable memory usage below 10M on my
64-bit system:

  require 'logger'
  logger = Logger.new('test.log', 7, 1073741824)
  str = '-' * 79
  15000000.times do
    logger.debug(str)
  end

···

Josh Miller <joshua@itsecureadmin.com> wrote:

I took your advice here and started disabling large chunks of the process and was able to isolate the issue to the logger config that I had established outside the Find.find (bad assumption on my part there).

logger = Logger.new(CONFIG['system']['log_file'], 7, 1073741824)

I changed the log size value to 10MB and the memory issue has disappeared. This was using logger (1.2.8).

Thanks again, Eric and Xavier.

I took your advice here and started disabling large chunks of the process and was able to isolate the issue to the logger config that I had established outside the Find.find (bad assumption on my part there).

logger = Logger.new(CONFIG['system']['log_file'], 7, 1073741824)

I changed the log size value to 10MB and the memory issue has disappeared. This was using logger (1.2.8).

Thanks again, Eric and Xavier.

No problem. Now I'm curious what CONFIG['system']['log_file']
is. I use logger from the Ruby stdlib (1.2.7) but never noticed
memory problems. I didn't even know there's a 1.2.8, but it
would be good to get this fixed if it's indeed a problem with
logger.

CONFIG[‘system’][‘log_file’] = 'log/glacier.log'

The following exhibits stable memory usage below 10M on my
64-bit system:

require 'logger'
logger = Logger.new('test.log', 7, 1073741824)
str = '-' * 79
15000000.times do
   logger.debug(str)
end

I ran that as well without issue.

I wonder if the issue was that the process was running for multiple days and filling all 7 of the 1GB log files. I’ll do some testing in that area and see what I might find.

Thanks,

Josh Miller
ITSA Consulting, LLC
skype: itsecureadmin
https://itsecureadmin.com/

···

On Apr 16, 2016, at 1:32 PM, Eric Wong <normalperson@yhbt.net> wrote:
Josh Miller <joshua@itsecureadmin.com> wrote: