Why IO#readlines does'nt accept a Regexp?

as in the subject, I just noticed that readlines just accepts a string
as line Separator, and I wonder why it works this way.
Some explanations?

BTW, if I want to read a file in a array of ‘words’ I have to do :

File.new(‘myfile’).gets(nil).split

no better way ?

on a sidenote, what are the efficiency issue related to the use of
IO#each vs IO#foreach(anIO) vs a simple ‘while line=gets…’ ?

as in the subject, I just noticed that readlines just accepts a string
as line Separator, and I wonder why it works this way.
Some explanations?

Sorry, none from me :wink:

BTW, if I want to read a file in a array of ‘words’ I have to do :

File.new(‘myfile’).gets(nil).split

no better way ?

File.read(‘myfile’).split

on a sidenote, what are the efficiency issue related to the use of
IO#each vs IO#foreach(anIO) vs a simple ‘while line=gets…’ ?

All of these read the file one line at a time and present that line to
the user. It’s hard to imagine any performance difference between
them.

Gavin

···

On Thursday, September 18, 2003, 5:53:20 PM, gabriele wrote:

“gabriele renzi” surrender_it@remove.yahoo.it schrieb im Newsbeitrag
news:uroimv475apt4pn9bqqtp2aa1s7ul4ngui@4ax.com

as in the subject, I just noticed that readlines just accepts a string
as line Separator, and I wonder why it works this way.
Some explanations?

Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for large
files.

However, you can simulate it if you read a complete file into a string and
then split with a regexp.

BTW, if I want to read a file in a array of ‘words’ I have to do :

File.new(‘myfile’).gets(nil).split

no better way ?

For large Files this is more efficient:

words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end

If you have many repeating words you can save even more mem:

cache = Hash.new {|h,k| h[k]=k}
words =

IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end

on a sidenote, what are the efficiency issue related to the use of
IO#each vs IO#foreach(anIO) vs a simple ‘while line=gets…’ ?

Try ruby -profile with each method and see what happens. I’d guess that
there is not much difference.

Regards

robert

thanks for all the answers.
bte, yet another solution that comes in my mind now:

ary=(File.new(‘bf.rb’).map { |l| l.scan(/\w+/) }).flatten

···

il Thu, 18 Sep 2003 07:51:25 GMT, gabriele renzi surrender_it@remove.yahoo.it ha scritto::

Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for large
files.

Another guess… For what I know of regexps, it’s that the are compiled to
a finite state machine. This compilation is more expensive than the actual
use since an FSM only considers each input character once. The only point
in splitting at newlines is that newlines tend to occur more frequently
than a random regexp match. BTW, what I meant by more expensive is in
terms of complexity, i.e., execution time as a function of the size of the
input. If your file is large enough, the compilation of the regexp will
become negligable since the regexp is small and the file is big.

Peter

For large Files this is more efficient:

words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end

The /oi modifiers aren’t necessary.

If you have many repeating words you can save even more mem:

cache = Hash.new {|h,k| h[k]=k}
words =

IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end

The #map isn’t doing what you think it is doing. To remove repeating
words from the list:

saw = Hash.new {|h,k| h[k] = true; false}
words =

IO.foreach(“myfile”) do |line|
words.push(*( line.scan(/\w+/).reject {|w| saw[w]} ))
end

Or if word order isn’t a concern:

cache = {}

IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each {|w| cache[w] = 1}
end

words = cache.keys

···

“Robert Klemme” bob.news@gmx.net wrote:

“Peter” Peter.Vanbroekhoven@cs.kuleuven.ac.be schrieb im Newsbeitrag
news:Pine.GSO.4.10.10309181322340.2343-100000@iris.cs.kuleuven.ac.be…

Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for
large
files.

Another guess… For what I know of regexps, it’s that the are compiled
to
a finite state machine. This compilation is more expensive than the
actual
use since an FSM only considers each input character once. The only
point
in splitting at newlines is that newlines tend to occur more frequently
than a random regexp match. BTW, what I meant by more expensive is in
terms of complexity, i.e., execution time as a function of the size of
the
input. If your file is large enough, the compilation of the regexp will
become negligable since the regexp is small and the file is big.

A totally different reason might be that ruby regexp interface is not
suitable for streaming input. I don’t know about the C internals but from
Ruby you have to provide a String, i.e. a sequence with known length.
This is different from providing an interator that just hands out a char
at a time.

robert

Or if you want to use the latest and greatest:

require ‘set’

words = Set.new
IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each { |w| words << w }
end

Gavin

···

On Friday, September 19, 2003, 10:16:02 AM, Sabby wrote:

Or if word order isn’t a concern:

cache = {}

IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each {|w| cache[w] = 1}
end

words = cache.keys

“Sabby and Tabby” sabbyxtabby@yahoo.com schrieb im Newsbeitrag
news:f5a79bf2.0309181614.1b288fb7@posting.google.com

For large Files this is more efficient:

words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end

The /oi modifiers aren’t necessary.

Granted. I just grew used to putting “o” in there whenever the rx doesn’t
change over time. Kind of a documentation thingy.

If you have many repeating words you can save even more mem:

cache = Hash.new {|h,k| h[k]=k}
words =

IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end

The #map isn’t doing what you think it is doing. To remove repeating
words from the list:

It does exactly what I think it’s doing. :slight_smile: I don’t want to remove
repeated words from the list but replace all identical strings with the
same instance to save memory. map fit’s the job perfectly. Of course,
you could use collect also… :slight_smile:

Regards

robert
···

“Robert Klemme” bob.news@gmx.net wrote:

One better:

require ‘set’

words = Set.new
IO.foreach(“myfile”) do |line|
words.merge(line.scan(/\w+/))
end

Gavin

···

On Friday, September 19, 2003, 11:20:47 AM, Gavin wrote:

Or if you want to use the latest and greatest:

require ‘set’

words = Set.new
IO.foreach(“myfile”) do |line|
line.scan(/\w+/).each { |w| words << w }
end