Number of lines in a text file

If I want the number of lines of the text file <file>, I may use

  > File.readlines(<file>).size

but this builds an useless extra Array, or

  > %x(wc -l <file>).to_i

but this needs to be on a *nix system (or have a system command wc.exe
on Windows).Or else a File.read followed by a grep for '\n'...

I feel there should be a simpler way to do that...
_md

···

--
Posted via http://www.ruby-forum.com/.

Have you looked at Enumerable's count method?

mike$ wc -l /etc/passwd
      83 /etc/passwd
mike$ ruby -e "puts File.open('/etc/passwd') { |f| f.count }"
83

Hope this helps,

Mike

···

On 2013-10-19, at 10:02 AM, Michel Demazure <lists@ruby-forum.com> wrote:

If I want the number of lines of the text file <file>, I may use

File.readlines(<file>).size

but this builds an useless extra Array, or

%x(wc -l <file>).to_i

but this needs to be on a *nix system (or have a system command wc.exe
on Windows).Or else a File.read followed by a grep for '\n'...

I feel there should be a simpler way to do that...
_md

--

Mike Stok <mike@stok.ca>
http://www.stok.ca/~mike/

The "`Stok' disclaimers" apply.

lines = File.foreach(file).count

Kind regards

robert

···

On Sat, Oct 19, 2013 at 4:02 PM, Michel Demazure <lists@ruby-forum.com> wrote:

If I want the number of lines of the text file <file>, I may use

  > File.readlines(<file>).size

but this builds an useless extra Array, or

  > %x(wc -l <file>).to_i

but this needs to be on a *nix system (or have a system command wc.exe
on Windows).Or else a File.read followed by a grep for '\n'...

I feel there should be a simpler way to do that...

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert Klemme wrote in post #1124923:

lines = File.foreach(file).count

Thanks, Robert, using 'foreach' is cleaner.

FWIW, I benchmarked. The File methods are equivalent and much faster.

  require 'benchmark'
  file = __FILE__
  n = 10000
  Benchmark.bm do |rep|
    rep.report("readlines") { n.times { File.readlines(file).size } }
    rep.report("wc -l ") { n.times { `wc -l #{file}`.to_i } }
    rep.report("foreach ") { n.times { File.foreach(file).count } }
  end

gives

              user system total real
   readlines 0.219000 0.499000 0.718000 ( 0.752043)
   wc -l 2.542000 5.257000 7.799000 ( 83.502776)
   foreach 0.219000 0.531000 0.750000 ( 0.761044)

_md

···

--
Posted via http://www.ruby-forum.com/\.

Robert Klemme wrote in post #1124923:

lines = File.foreach(file).count

Thanks, Robert, using 'foreach' is cleaner.

Yes, and it avoids building an Array for the whole file in memory.

FWIW, I benchmarked. The File methods are equivalent and much faster.

Naturally since they avoid the overhead of forking and IPC.

  require 'benchmark'
  file = __FILE__
  n = 10000
  Benchmark.bm do |rep|
    rep.report("readlines") { n.times { File.readlines(file).size } }
    rep.report("wc -l ") { n.times { `wc -l #{file}`.to_i } }
    rep.report("foreach ") { n.times { File.foreach(file).count } }
  end

gives

              user system total real
   readlines 0.219000 0.499000 0.718000 ( 0.752043)
   wc -l 2.542000 5.257000 7.799000 ( 83.502776)
   foreach 0.219000 0.531000 0.750000 ( 0.761044)

It would be interesting to see how that works out for a large file. I
would expect the last version to be more efficiently than the first
one.

Kind regards

robert

···

On Sun, Oct 20, 2013 at 10:37 AM, Michel Demazure <lists@ruby-forum.com> wrote:

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert Klemme wrote in post #1124958:

It would be interesting to see how that works out for a large file. I
would expect the last version to be more efficiently than the first
one.

I would guess so. But this below shows the same pattern : Readlines a
bit faster.

file = File.join(File.dirname(__FILE__), 'test.txt')
File.open(file, 'w') do |file|
  3000.times { file.puts 'bla' * 10 }
end

n = 10000
Benchmark.bm do |rep|
  rep.report("readlines") { n.times { File.readlines(file).size } }
  rep.report("foreach ") { n.times { File.foreach(file).count} }
end

            user system total real
readlines 11.341000 1.217000 12.558000 ( 12.686726)
foreach 12.433000 1.264000 13.697000 ( 13.871793)

···

--
Posted via http://www.ruby-forum.com/\.

Michel Demazure wrote in post #1124962:

            user system total real
readlines 11.341000 1.217000 12.558000 ( 12.686726)
foreach 12.433000 1.264000 13.697000 ( 13.871793)

With 300_000 lines and 100 times, instead of 3_000 lines and 10_000
times, one gets the same pattern :

           user system total real
readlines 11.622000 1.060000 12.682000 ( 12.692726)
foreach 12.246000 0.858000 13.104000 ( 13.156753)

but the difference is smaller...

_md

···

--
Posted via http://www.ruby-forum.com/\.

$ ruby x.rb
       user system total real
readlines 56.831000 7.597000 64.428000 ( 64.241000)
foreach 50.357000 5.476000 55.833000 ( 56.153000)
$ cat x.rb

require 'tempfile'
require 'benchmark'

LINE = 'x' * 99
n = 100

Tempfile.open(ENV['TMP'] || '/tmp') do |tmp|
  1_000_000.times { tmp.puts LINE }

  Benchmark.bm do |rep|
    rep.report("readlines") { n.times { File.readlines(tmp.path).size } }
    rep.report("foreach ") { n.times { File.foreach(tmp.path).count} }
  end

end

So with even larger files the difference shows. :slight_smile:

Kind regards

robert

···

On Sun, Oct 20, 2013 at 5:14 PM, Michel Demazure <lists@ruby-forum.com> wrote:

Michel Demazure wrote in post #1124962:

            user system total real
readlines 11.341000 1.217000 12.558000 ( 12.686726)
foreach 12.433000 1.264000 13.697000 ( 13.871793)

With 300_000 lines and 100 times, instead of 3_000 lines and 10_000
times, one gets the same pattern :

           user system total real
readlines 11.622000 1.060000 12.682000 ( 12.692726)
foreach 12.246000 0.858000 13.104000 ( 13.156753)

but the difference is smaller...

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

What about space? That's also a huge consideration here, isn't it? foreach should win that by lots and lots, too.

···

On Oct 20, 2013, at 4:13 PM, Robert Klemme <shortcutter@googlemail.com> wrote:

On Sun, Oct 20, 2013 at 5:14 PM, Michel Demazure <lists@ruby-forum.com> wrote:

Michel Demazure wrote in post #1124962:

           user system total real
readlines 11.341000 1.217000 12.558000 ( 12.686726)
foreach 12.433000 1.264000 13.697000 ( 13.871793)

With 300_000 lines and 100 times, instead of 3_000 lines and 10_000
times, one gets the same pattern :

          user system total real
readlines 11.622000 1.060000 12.682000 ( 12.692726)
foreach 12.246000 0.858000 13.104000 ( 13.156753)

but the difference is smaller...

$ ruby x.rb
      user system total real
readlines 56.831000 7.597000 64.428000 ( 64.241000)
foreach 50.357000 5.476000 55.833000 ( 56.153000)
$ cat x.rb

require 'tempfile'
require 'benchmark'

LINE = 'x' * 99
n = 100

Tempfile.open(ENV['TMP'] || '/tmp') do |tmp|
1_000_000.times { tmp.puts LINE }

Benchmark.bm do |rep|
   rep.report("readlines") { n.times { File.readlines(tmp.path).size } }
   rep.report("foreach ") { n.times { File.foreach(tmp.path).count} }
end

end

So with even larger files the difference shows. :slight_smile:

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

tamouse m. wrote in post #1124992:

What about space? That's also a huge consideration here, isn't it?
foreach should win that by lots and lots, too.

Sure.
_md

···

--
Posted via http://www.ruby-forum.com/\.