Moving files matching Regex

Mark9 · 25 November 2007 10:00

Hi all,

A question relating to the code below.

What I'm trying to do is:
Files named CAL221107.xls, NCPH141202.xls or GOH333333.xls for example
in the current directory should be appended to an array (just the file
name, not the full path) and moved to ./sent/

It seems to be working, but is there a more Rubyish way of doing it?

···

-------------------------------------------------------------------------
require 'find'
require 'fileutils'

a=[]
fglob = Regexp.new('(CAL|NCPH|GOH)\d{6}\.xls')

Find.find('./') do |path|
  curr_file = File.basename(path)
  if curr_file =~ /#{fglob}/
    a << curr_file
    new_file = "./sent/#{curr_file}"
    FileUtils.mv(curr_file, new_file) if not File.exists?(new_file)
  end
end

puts a

-------------------------------------------------------------------------

thanks,

--
Mark

Stefano_Crocco · 25 November 2007 10:30

Alle domenica 25 novembre 2007, Mark Woodward ha scritto:

Hi all,

A question relating to the code below.

What I'm trying to do is:
Files named CAL221107.xls, NCPH141202.xls or GOH333333.xls for example
in the current directory should be appended to an array (just the file
name, not the full path) and moved to ./sent/

It seems to be working, but is there a more Rubyish way of doing it?

-------------------------------------------------------------------------
require 'find'
require 'fileutils'

a=
fglob = Regexp.new('(CAL|NCPH|GOH)\d{6}\.xls')

Find.find('./') do |path|
  curr_file = File.basename(path)
  if curr_file =~ /#{fglob}/
    a << curr_file
    new_file = "./sent/#{curr_file}"
    FileUtils.mv(curr_file, new_file) if not File.exists?(new_file)
  end
end

puts a

-------------------------------------------------------------------------

thanks,

Not very different:

require 'find'
require 'fileutils'

a=
fglob = /(CAL|NCPH|GOH)\d{6}\.xls/

Find.find('.') do |f|
  name = File.basename f
  if name=~ fglob
    a << name
    FileUtils.mv f, "sent/#{name}" unless File.exist? "sent/#{name}"
  end
end

A couple of notes:
* You can create a regexp using the // syntax, instead of using Regexp.new
* when you call String#=~ you don't need to use /#{...}/, if you already have
a regexp
* I think you need to pass path, and not curr_file, as the first argument of
FileUtils.mv, otherwise it won't work if the file is in a subdirectory.

Stefano

Mark9 · 26 November 2007 10:45

On Sun, 25 Nov 2007 20:59:46 +1100

···

Mark Woodward <markonlinux@internode.on.net> wrote:

thanks to all replies. Very informative.
I've benchmarked this and 'get_file_names2' is considerably quicker!
(if I'm benchmarking correctly?)

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

def get_file_names
  fn=[]
  Find.find('./') do |path|
    Find.prune if File.basename(path) == 'sent'
    curr_file = File.basename(path)
    if curr_file =~ FILE_REGEX
      fn << curr_file
    end
  end
  fn
end

def get_file_names2
  fglob= %r{^(CAL|NCPH|GOH)\d{6}\.xls$}x
  all_files = Dir.glob("*")
  my_files = all_files.grep(fglob)
end

Benchmark.bm(5) do |timer|
timer.report('get_file_names') {10000.times {get_file_names}}
timer.report('get_file_names2') {10000.times {get_file_names2}}
end

Results:

user system total real
get_file_names 3.040000 0.700000 3.740000 (5.440394)
get_file_names2 0.570000 0.160000 0.730000 (0.997355)

cheers,

--
Mark

Mark9 · 25 November 2007 11:10

Hi Stefano,

Not very different:

require 'find'
require 'fileutils'

a=
fglob = /(CAL|NCPH|GOH)\d{6}\.xls/

Find.find('.') do |f|
  name = File.basename f
  if name=~ fglob
    a << name
    FileUtils.mv f, "sent/#{name}" unless File.exist? "sent/#{name}"
  end
end

A couple of notes:
* You can create a regexp using the // syntax, instead of using
Regexp.new

that's better, thanks

* when you call String#=~ you don't need to use /#{...}/, if you
already have a regexp

another good thing to know.

* I think you need to pass path, and not curr_file, as the first
argument of FileUtils.mv, otherwise it won't work if the file is in a
subdirectory.

There won't be any sub directories except for /sent.
I've discovered from the cookbook I can use prune to ignore it.
Below's where I'm up to.

Stefano

appreciate your help,
cheers,

···

On Sun, 25 Nov 2007 05:30:10 -0500 Stefano Crocco <stefano.crocco@alice.it> wrote:

--
Mark

---------------------------------------------------------------------------
require 'find'
require 'fileutils'

FILE_REGEX = /(CAL|NCPH|GOH)\d{6}\.xls/
SENT_DIR = "./sent/"

def get_file_names
  fn=
  Find.find('./') do |path|
    Find.prune if File.basename(path) == 'sent'
    curr_file = File.basename(path)
    if curr_file =~ FILE_REGEX
      fn << curr_file
    end
  end
  fn
end

def move_files
  @a.each do |fn|
    new_loc = SENT_DIR + fn
    FileUtils.mv(fn, new_loc) if not File.exists?(new_loc)
  end
end

@a = get_file_names
move_files
puts @a

Robert_K1 · 26 November 2007 12:35

On Sun, 25 Nov 2007 20:59:46 +1100

thanks to all replies. Very informative.
I've benchmarked this and 'get_file_names2' is considerably quicker!
(if I'm benchmarking correctly?)

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

In _2 you do a dir glob in *one* directory only while find recursively
descends a *directory tree*. Even though you prune in some cases it
comes as no surprise that this is slower.

def get_file_names
  fn=
  Find.find('./') do |path|
    Find.prune if File.basename(path) == 'sent'
    curr_file = File.basename(path)
    if curr_file =~ FILE_REGEX
      fn << curr_file
    end
  end
  fn
end

def get_file_names2
  fglob= %r{^(CAL|NCPH|GOH)\d{6}\.xls$}x
  all_files = Dir.glob("*")
  my_files = all_files.grep(fglob)
end

Benchmark.bm(5) do |timer|
  timer.report('get_file_names') {10000.times {get_file_names}}
  timer.report('get_file_names2') {10000.times {get_file_names2}}
end

Results:

                 user system total real
get_file_names 3.040000 0.700000 3.740000 (5.440394)
get_file_names2 0.570000 0.160000 0.730000 (0.997355)

Kind regards

robert

···

2007/11/26, Mark Woodward <markonlinux@internode.on.net>:

Mark Woodward <markonlinux@internode.on.net> wrote:

--
use.inject do |as, often| as.you_can - without end

Raul_Parolari · 25 November 2007 11:48

Mark Woodward wrote:

There won't be any sub directories except for /sent.

I add this to the advice from Stefano:

1) if you just want to find the files in the current directory, then we
do not need to use Find (which, as Stefano pointed out, could find files
in subdirectories that you may not foresee now..); and we can collect
the files in one shot with grep.

2) for precaution, it is better to match the complete file name (just in
case there may be other files with that name + an extension, eg like
.saved, etc).

3) we could place the method that moves the files into class Array,
rather than leaving it in the toplevel. (Consider a param for the
destination dir, unless you are sure that it will be always a Constant).

class Array
def move_to_subdir
   self.each do |fn|
     new_loc = SENT_DIR + fn
     FileUtils.mv(fn, new_loc) unless File.exists?(new_loc)
   end
end
end

# the main is now 4 lines:

fglob= %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x

all_files = Dir.glob("*")
my_files = all_files.grep(fglob)

my_files.move_to_subdir

Regards,

Raul

···

--
Posted via http://www.ruby-forum.com/\.

Raul_Parolari · 26 November 2007 19:39

Mark Woodward wrote:

I've benchmarked this and 'get_file_names2' is considerably quicker!

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

Robert Klemme wrote:

In _2 you do a dir glob in *one* directory only while find recursively
descends a *directory tree*. Even though you prune in some cases it
comes as no surprise that this is slower.

Actually, I think that most of the gain in the final solution comes from
this:
a) in the original solution, the result array is built inserting
manually each file that matches the pattern in the array..
b) in the final solution (get_file_names2) not only Find is replaced by
Dir.glob, but the result array is built by grep.

This may sound surprising, but map (collect) and grep are highly
optimized C mechanisms; moreover they can do smart guesses on the size
of the result array (in particular, grep can inmediately conclude that
the size of the result array will be less than the array given to him).
In the manual solution instead, every item added to the array probably
requires some memory reallocation.

Let's verify if this is true, benchmarking 3 variants:

0) get_file_names : Find + manual array build
1) get_file_names1: Dir.glob + manual array build
2) get_file_names2: Dir.glob + grep

# uses Find + manual array build
def get_file_names
  fn=
  Find.find('./') do |path|
    Find.prune if File.basename(path) == 'sent'
    curr_file = File.basename(path)
    if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
      fn << curr_file
    end
  end
  fn
end

# uses Dir.glob + manual array build
def get_file_names1
  all_files = Dir.glob("*")
  my_files.each do |path|
    curr_file = File.basename(path)
    if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
      fn << curr_file
    end
  end
  fn
end

# uses Dir.glob + grep
def get_file_names2
all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

# with 10 files matching the pattern
Benchmark.bm(5) do |timer|
  timer.report('get_file_names') {10000.times {get_file_names}}
  timer.report('get_file_names1') {10000.times {get_file_names}}
  timer.report('get_file_names2') {10000.times {get_file_names2}}
end

user system total real
get_file_names 4.680000 3.870000 8.550000 ( 8.567400)
get_file_names1 4.670000 3.870000 8.540000 ( 8.564753)
get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)

The 580% improvement in the final solution comes almost exclusively from
the grep.

···

---
I add this for Mark:
it is best in any case to avoid the unnecessary 'Find', for several
reasons; the first is that what you need is the simple Dir.glob.

But let me cite a more insidious problem in that code: the idea of
pruning explicitly 'sent'. Even if you think that the only subdirectory
present will be 'sent', what happens if 8 months from now another person
(or yourself, after some more thousands lines of code!) needs to add
another subdir with files matching that pattern (perhaps just to save a
version of them)? will he remember the assumption done months before?

Moreover, as Stefano pointed out at the beginning of the thread, this
would cause subtle file corruptions, as the code is not set up to deal
with the path correctly (a testing nightmare would follow, until
somebody remembers the 'assumption'; some people have seen this hundreds
of times..:-).

So, the best is not to code 'assumptions' in the code; and of course to
use the right tools for the problem at hand: in this case, Dir.glob and
the beautiful grep

Regards

Raul

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 25 November 2007 16:24

I would not do that because it is not really an Array operation. Having a top level method (aka function) is better IMHO.

Kind regards

robert

···

On 25.11.2007 12:48, Raul Parolari wrote:

3) we could place the method that moves the files into class Array, rather than leaving it in the toplevel. (Consider a param for the destination dir, unless you are sure that it will be always a Constant).

Robert_K1 · 26 November 2007 20:35

Mark Woodward wrote:

I've benchmarked this and 'get_file_names2' is considerably quicker!

Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?

Robert Klemme wrote:

In _2 you do a dir glob in *one* directory only while find recursively
descends a *directory tree*. Even though you prune in some cases it
comes as no surprise that this is slower.

Actually, I think that most of the gain in the final solution comes from this:
a) in the original solution, the result array is built inserting manually each file that matches the pattern in the array..
b) in the final solution (get_file_names2) not only Find is replaced by Dir.glob, but the result array is built by grep.

This may sound surprising, but map (collect) and grep are highly optimized C mechanisms; moreover they can do smart guesses on the size of the result array (in particular, grep can inmediately conclude that the size of the result array will be less than the array given to him). In the manual solution instead, every item added to the array probably requires some memory reallocation.

Let's verify if this is true, benchmarking 3 variants:

0) get_file_names : Find + manual array build
1) get_file_names1: Dir.glob + manual array build
2) get_file_names2: Dir.glob + grep

# uses Find + manual array build
def get_file_names
  fn=
  Find.find('./') do |path|
    Find.prune if File.basename(path) == 'sent'
    curr_file = File.basename(path)
    if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
      fn << curr_file
    end
  end
  fn
end

# uses Dir.glob + manual array build
def get_file_names1
  all_files = Dir.glob("*")
  my_files.each do |path|
    curr_file = File.basename(path)
    if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
      fn << curr_file
    end
  end
  fn
end

# uses Dir.glob + grep
def get_file_names2
  all_files = Dir.glob("*")
  my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

What about this one?

def get_file_names_3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

# with 10 files matching the pattern
Benchmark.bm(5) do |timer|
  timer.report('get_file_names') {10000.times {get_file_names}}
  timer.report('get_file_names1') {10000.times {get_file_names}}
  timer.report('get_file_names2') {10000.times {get_file_names2}}
end

                 user system total real
get_file_names 4.680000 3.870000 8.550000 ( 8.567400)
get_file_names1 4.670000 3.870000 8.540000 ( 8.564753)
get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)

The 580% improvement in the final solution comes almost exclusively from the grep.

---
I add this for Mark:
it is best in any case to avoid the unnecessary 'Find', for several reasons; the first is that what you need is the simple Dir.glob.

But let me cite a more insidious problem in that code: the idea of pruning explicitly 'sent'. Even if you think that the only subdirectory present will be 'sent', what happens if 8 months from now another person (or yourself, after some more thousands lines of code!) needs to add another subdir with files matching that pattern (perhaps just to save a version of them)? will he remember the assumption done months before?

Moreover, as Stefano pointed out at the beginning of the thread, this would cause subtle file corruptions, as the code is not set up to deal with the path correctly (a testing nightmare would follow, until somebody remembers the 'assumption'; some people have seen this hundreds of times..:-).

So, the best is not to code 'assumptions' in the code; and of course to use the right tools for the problem at hand: in this case, Dir.glob and the beautiful grep

Good point! That's basically the same reason why it's better to explicitely open ports in firewall configs than to close "dangerous" ports.

Kind regards

robert

···

On 26.11.2007 20:39, Raul Parolari wrote:

Raul_Parolari · 25 November 2007 22:46

Robert Klemme wrote:

I would not do that because it is not really an Array operation.

You are right; it does not belong in class Array.

Having a top level method (aka function) is better IMHO.

Here I use the occasion to share a perplexity I have, when I look at
Ruby programs, and I sometimes see a myriad of top-level methods. If I
ask the reason, the answer is a vague "they do not really fit well
anywhere", which leaves me a bit disconcerted (especially when the
script is part of a project that may grow, not something written to be
used once).

In the case discussed, I would place this method in a module (that would
just need to be required):

module MyUtils
  def self.mv_files_to_subdir(files, subdir)
    files.each do |curr_file|
      ..
    end
  end
end

or, as this is in fact a file operation, I would consider to reopen
FileUtils and add the method:

···

#--
module FileUtils
  def self.mv_files_to_subdir(files, subdir)
    files.each do |curr_file|
      new_file = "./#{subdir}/#{curr_file}"
      FileUtils.mv(curr_file, new_file) unless File.exists?(new_file)
    end
  end
end

fglob= %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
all_files = Dir.glob("*")
my_files = all_files.grep(fglob)

FileUtils.mv_files_to_subdir(my_files, 'sent')
#-------

It is a small point (some may say 'but what's the difference with a
top-level method?'), but let me know your opinion, if you have a chance

Regards,

Raul
--
Posted via http://www.ruby-forum.com/\.

Raul_Parolari · 26 November 2007 23:17

Robert Klemme wrote:

What about this one?

def get_file_names_3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

Robert

I had not tested case 1 in the benchmark (I dropped digit '1' in the
call); it turns out that it is the Find, as you had thought, to be the
one that causes most inefficiency; without repeating all the code, these
are the results, including also your last suggestion.

# uses Find, and selects files manually
def get_file_names
  fn=
  ..
  fn
end

# 1) uses Dir.glob and builds array with a loop
def get_file_names1
  fn=
  all_files = Dir.glob("*")
  ..
  fn
end

# 2) uses Dir.glob and grep
def get_file_names2
all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

# 3) variation of solution 2
def get_file_names3
Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

# with 30 files
Benchmark.bm(5) do |timer|
  timer.report('get_file_names') {10_000.times {get_file_names} }
  timer.report('get_file_names1') {10_000.times {get_file_names1} }
  timer.report('get_file_names2') {10_000.times {get_file_names2} }
  timer.report('get_file_names3') {10_000.times {get_file_names3} }
end

user system total real
get_file_names 14.640000 9.080000 23.720000 ( 23.778029)
get_file_names1 1.690000 1.200000 2.890000 ( 2.903737)
get_file_names2 1.370000 1.210000 2.580000 ( 2.581539)
get_file_names3 1.430000 3.530000 4.960000 ( 4.968951)

Solution 2) is, as we saw before, the winner: 10 times faster than the
original solution. But the grep only improves things by 10%; 90% of the
contribution comes from removing Find (as Robert had guessed)!

Regarding the last one (called solution 3) from you:

Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]

This turned out to be 2 times slower than solution 2. Checking if
something is between 0-9 is apparently quite slower than checking for
'digit'.

In conclusion:
Solution 2 is the fastest; but the reason is not the grep (as I had
theorized), which accounts only for 10% of the improvement; the other
90% comes from removing Find, as Robert had guessed.

Last consideration: increasing the number of files, the weight of 'grep'
in the improvement increases (but, enough of benchmarks for to-day :-).

Regards

Raul

···

--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 26 November 2007 07:34

Definitively the right way to go if there is a chance that the code is
reused and might create namespace collisions in a larger project. If
if is just used in a simple script I would stick with the top level
method defined inside the script because distributing the code would
be unnecessarily complex in that case. One can refactor later if
needed.

Kind regards

robert

···

2007/11/25, Raul Parolari <raulparolari@gmail.com>:

Robert Klemme wrote:

> I would not do that because it is not really an Array operation.

You are right; it does not belong in class Array.

> Having a top level method (aka function) is better IMHO.

Here I use the occasion to share a perplexity I have, when I look at
Ruby programs, and I sometimes see a myriad of top-level methods. If I
ask the reason, the answer is a vague "they do not really fit well
anywhere", which leaves me a bit disconcerted (especially when the
script is part of a project that may grow, not something written to be
used once).

In the case discussed, I would place this method in a module (that would
just need to be required):

module MyUtils
  def self.mv_files_to_subdir(files, subdir)
    files.each do |curr_file|
      ..
    end
  end
end

or, as this is in fact a file operation, I would consider to reopen
FileUtils and add the method:

#--
module FileUtils
  def self.mv_files_to_subdir(files, subdir)
    files.each do |curr_file|
      new_file = "./#{subdir}/#{curr_file}"
      FileUtils.mv(curr_file, new_file) unless File.exists?(new_file)
    end
  end
end

fglob= %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
all_files = Dir.glob("*")
my_files = all_files.grep(fglob)

FileUtils.mv_files_to_subdir(my_files, 'sent')
#-------

It is a small point (some may say 'but what's the difference with a
top-level method?'), but let me know your opinion, if you have a chance

--
use.inject do |as, often| as.you_can - without end

Robert_K1 · 27 November 2007 07:42

Robert Klemme wrote:

> What about this one?
>
> def get_file_names_3
> Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
> end

Robert

  I had not tested case 1 in the benchmark (I dropped digit '1' in the
call); it turns out that it is the Find, as you had thought, to be the
one that causes most inefficiency; without repeating all the code, these
are the results, including also your last suggestion.

# uses Find, and selects files manually
def get_file_names
  fn=
  ..
  fn
end

# 1) uses Dir.glob and builds array with a loop
def get_file_names1
  fn=
  all_files = Dir.glob("*")
  ..
  fn
end

# 2) uses Dir.glob and grep
def get_file_names2
  all_files = Dir.glob("*")
  my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

We completely forgot about Dir.entries()... That might be a tad
faster because supposedly no globbing is going on.

# 3) variation of solution 2
def get_file_names3
   Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

# with 30 files
Benchmark.bm(5) do |timer|
  timer.report('get_file_names') {10_000.times {get_file_names} }
  timer.report('get_file_names1') {10_000.times {get_file_names1} }
  timer.report('get_file_names2') {10_000.times {get_file_names2} }
  timer.report('get_file_names3') {10_000.times {get_file_names3} }
end

               user system total real
get_file_names 14.640000 9.080000 23.720000 ( 23.778029)
get_file_names1 1.690000 1.200000 2.890000 ( 2.903737)
get_file_names2 1.370000 1.210000 2.580000 ( 2.581539)
get_file_names3 1.430000 3.530000 4.960000 ( 4.968951)

Solution 2) is, as we saw before, the winner: 10 times faster than the
original solution. But the grep only improves things by 10%; 90% of the
contribution comes from removing Find (as Robert had guessed)!

Regarding the last one (called solution 3) from you:
> Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]

This turned out to be 2 times slower than solution 2. Checking if
something is between 0-9 is apparently quite slower than checking for
'digit'.

In conclusion:
Solution 2 is the fastest; but the reason is not the grep (as I had
theorized), which accounts only for 10% of the improvement; the other
90% comes from removing Find, as Robert had guessed.

It's an old truth: access to external memory is always an order of
magnitude slower than access to main memory. Ergo, if you can reduce
the number of accesses to that memory you typically achieve big
improvements.

Last consideration: increasing the number of files, the weight of 'grep'
in the improvement increases (but, enough of benchmarks for to-day :-).

Thanks for taking the time to go through this.

Kind regards

robert

···

2007/11/27, Raul Parolari <raulparolari@gmail.com>:

--
use.inject do |as, often| as.you_can - without end

Raul_Parolari · 27 November 2007 08:39

Robert Klemme wrote:

all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

We completely forgot about Dir.entries()... That might be a tad
faster because supposedly no globbing is going on.

The solution with Dir.entries is a bit faster:

def get_file_names5
all_files = Dir.entries(".")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

# 30 target files present
Benchmark.bm(5) do |timer|
  timer.report('get_file_names2') {10_000.times {get_file_names2} }
  timer.report('get_file_names5') {10_000.times {get_file_names5} }
end
                 user system total real
get_file_names2 1.410000 1.260000 2.670000 ( 2.675148)
get_file_names5 1.190000 1.260000 2.450000 ( 2.450649)

A 10% gain; not much, but the final winner of the marathon is: Nr 5!

Raul

···

--
Posted via http://www.ruby-forum.com/\.

Jordan_Callicoat · 27 November 2007 08:40

Since all the matches are xsl files, the function can be faster using
a glob for "*.xsl". Dir.entries is slower.

$ cat test.rb && ./test.rb
#!/usr/bin/env ruby

require 'find'
require 'fileutils'
require 'benchmark'

def get_file_names
  Dir.glob("*.xsl").select { | e |
    e =~ /^(CAL|NCPH|GOH)\d{6}\.xsl$/
  }
end

def get_file_names1
Dir.glob("*.xsl").grep(/^(CAL|NCPH|GOH)\d{6}\.xsl$/)
end

def get_file_names2
  Dir.entries('.').select { | e |
    e =~ /^(CAL|NCPH|GOH)\d{6}\.xsl$/
  }
end

n = 10000
Benchmark.bm(5) { |x|
  x.report('get_file_names ') { n.times { get_file_names } }
  x.report('get_file_names1') { n.times { get_file_names1 } }
  x.report('get_file_names1') { n.times { get_file_names2 } }
}

user system total real
get_file_names 0.210000 0.330000 0.540000 ( 0.566366)
get_file_names1 0.270000 0.310000 0.580000 ( 0.602565)
get_file_names1 0.680000 0.310000 0.990000 ( 1.114583)

Regards,
Jordan

Mark9 · 27 November 2007 09:55

Excellent stuff everyone,
Gives me much to look into!

thanks,

···

On Tue, 27 Nov 2007 03:39:05 -0500 Raul Parolari <raulparolari@gmail.com> wrote:

Robert Klemme wrote:

>> all_files = Dir.glob("*")
>> my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
>> end
>
> We completely forgot about Dir.entries()... That might be a tad
> faster because supposedly no globbing is going on.
>

The solution with Dir.entries is a bit faster:

def get_file_names5
   all_files = Dir.entries(".")
   my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
end

# 30 target files present
Benchmark.bm(5) do |timer|
  timer.report('get_file_names2') {10_000.times {get_file_names2} }
  timer.report('get_file_names5') {10_000.times {get_file_names5} }
end
                 user system total real
get_file_names2 1.410000 1.260000 2.670000 ( 2.675148)
get_file_names5 1.190000 1.260000 2.450000 ( 2.450649)

A 10% gain; not much, but the final winner of the marathon is: Nr 5!

Raul

--
Mark

Topic		Replies	Views
Find from array and move ruby-talk	0	137	17 March 2012
Find from array and move ruby-talk	0	128	18 March 2012
Reg. Expressions with file directories ruby-talk	2	132	3 October 2008
Finding files with regular expressions ruby-talk	8	168	2 October 2007
Help with moving files ruby-talk	5	70	3 June 2007

Moving files matching Regex

Related topics