Why IO#readlines does'nt accept a Regexp?

Laurent_Julliard4 · 18 September 2003 07:53

as in the subject, I just noticed that readlines just accepts a string
as line Separator, and I wonder why it works this way.
Some explanations?

BTW, if I want to read a file in a array of ‘words’ I have to do :

File.new(‘myfile’).gets(nil).split

no better way ?

on a sidenote, what are the efficiency issue related to the use of
IO#each vs IO#foreach(anIO) vs a simple ‘while line=gets…’ ?

Gavin_Sinclair · 18 September 2003 08:01

as in the subject, I just noticed that readlines just accepts a string
as line Separator, and I wonder why it works this way.
Some explanations?

Sorry, none from me

BTW, if I want to read a file in a array of ‘words’ I have to do :

File.new(‘myfile’).gets(nil).split

no better way ?

File.read(‘myfile’).split

on a sidenote, what are the efficiency issue related to the use of
IO#each vs IO#foreach(anIO) vs a simple ‘while line=gets…’ ?

All of these read the file one line at a time and present that line to
the user. It’s hard to imagine any performance difference between
them.

Gavin

···

On Thursday, September 18, 2003, 5:53:20 PM, gabriele wrote:

Robert · 18 September 2003 08:53

“gabriele renzi” surrender_it@remove.yahoo.it schrieb im Newsbeitrag
news:uroimv475apt4pn9bqqtp2aa1s7ul4ngui@4ax.com…

as in the subject, I just noticed that readlines just accepts a string
as line Separator, and I wonder why it works this way.
Some explanations?

Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for large
files.

However, you can simulate it if you read a complete file into a string and
then split with a regexp.

BTW, if I want to read a file in a array of ‘words’ I have to do :

File.new(‘myfile’).gets(nil).split

no better way ?

For large Files this is more efficient:

words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end

If you have many repeating words you can save even more mem:

cache = Hash.new {|h,k| h[k]=k}
words =

IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end

on a sidenote, what are the efficiency issue related to the use of
IO#each vs IO#foreach(anIO) vs a simple ‘while line=gets…’ ?

Try ruby -profile with each method and see what happens. I’d guess that
there is not much difference.

Regards

robert

Laurent_Julliard4 · 20 September 2003 22:44

thanks for all the answers.
bte, yet another solution that comes in my mind now:

ary=(File.new(‘bf.rb’).map { |l| l.scan(/\w+/) }).flatten

···

il Thu, 18 Sep 2003 07:51:25 GMT, gabriele renzi surrender_it@remove.yahoo.it ha scritto::

Peter3 · 18 September 2003 11:30

Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for large
files.

Another guess… For what I know of regexps, it’s that the are compiled to
a finite state machine. This compilation is more expensive than the actual
use since an FSM only considers each input character once. The only point
in splitting at newlines is that newlines tend to occur more frequently
than a random regexp match. BTW, what I meant by more expensive is in
terms of complexity, i.e., execution time as a function of the size of the
input. If your file is large enough, the compilation of the regexp will
become negligable since the regexp is small and the file is big.

Peter

Sabbyxtabby · 19 September 2003 00:16

For large Files this is more efficient:

words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end

The /oi modifiers aren’t necessary.

If you have many repeating words you can save even more mem:

cache = Hash.new {|h,k| h[k]=k}
words =

IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end

The #map isn’t doing what you think it is doing. To remove repeating
words from the list:

saw = Hash.new {|h,k| h[k] = true; false}
words =

IO.foreach(“myfile”) do |line|
words.push(*( line.scan(/\w+/).reject {|w| saw[w]} ))
end

Or if word order isn’t a concern:

cache = {}

IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each {|w| cache[w] = 1}
end

words = cache.keys

···

“Robert Klemme” bob.news@gmx.net wrote:

Robert · 18 September 2003 15:14

“Peter” Peter.Vanbroekhoven@cs.kuleuven.ac.be schrieb im Newsbeitrag
news:Pine.GSO.4.10.10309181322340.2343-100000@iris.cs.kuleuven.ac.be…

Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for
large
files.

Another guess… For what I know of regexps, it’s that the are compiled
to
a finite state machine. This compilation is more expensive than the
actual
use since an FSM only considers each input character once. The only
point
in splitting at newlines is that newlines tend to occur more frequently
than a random regexp match. BTW, what I meant by more expensive is in
terms of complexity, i.e., execution time as a function of the size of
the
input. If your file is large enough, the compilation of the regexp will
become negligable since the regexp is small and the file is big.

A totally different reason might be that ruby regexp interface is not
suitable for streaming input. I don’t know about the C internals but from
Ruby you have to provide a String, i.e. a sequence with known length.
This is different from providing an interator that just hands out a char
at a time.

robert

Gavin_Sinclair · 19 September 2003 01:20

Or if you want to use the latest and greatest:

require ‘set’

words = Set.new
IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each { |w| words << w }
end

Gavin

···

On Friday, September 19, 2003, 10:16:02 AM, Sabby wrote:

Or if word order isn’t a concern:

cache = {}

IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each {|w| cache[w] = 1}
end

words = cache.keys

Robert · 19 September 2003 08:37

“Sabby and Tabby” sabbyxtabby@yahoo.com schrieb im Newsbeitrag
news:f5a79bf2.0309181614.1b288fb7@posting.google.com…

For large Files this is more efficient:

words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end

The /oi modifiers aren’t necessary.

Granted. I just grew used to putting “o” in there whenever the rx doesn’t
change over time. Kind of a documentation thingy.

If you have many repeating words you can save even more mem:

cache = Hash.new {|h,k| h[k]=k}
words =

IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end

The #map isn’t doing what you think it is doing. To remove repeating
words from the list:

It does exactly what I think it’s doing. I don’t want to remove
repeated words from the list but replace all identical strings with the
same instance to save memory. map fit’s the job perfectly. Of course,
you could use collect also…

Regards

robert

···

“Robert Klemme” bob.news@gmx.net wrote:

Gavin_Sinclair · 19 September 2003 02:30

One better:

require ‘set’

words = Set.new
IO.foreach(“myfile”) do |line|
words.merge(line.scan(/\w+/))
end

Gavin

···

On Friday, September 19, 2003, 11:20:47 AM, Gavin wrote:

Or if you want to use the latest and greatest:

require ‘set’

words = Set.new
IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each { |w| words << w }
end

Topic		Replies	Views
IO#gets with regex ruby-talk	1	74	16 July 2008
Regexp issue on parsing from file ruby-talk	10	114	15 August 2009
Checking File for a line ruby-talk	8	85	25 November 2007
Ios.gets doesn't seem to work as advertised ruby-talk	4	107	5 June 2003
Regexp.match(line) question ruby-talk	4	79	14 June 2011

Why IO#readlines does'nt accept a Regexp?

Related Topics