as in the subject, I just noticed that readlines just accepts a string
as line Separator, and I wonder why it works this way.
Some explanations?
Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for large
files.
However, you can simulate it if you read a complete file into a string and
then split with a regexp.
BTW, if I want to read a file in a array of ‘words’ I have to do :
File.new(‘myfile’).gets(nil).split
no better way ?
For large Files this is more efficient:
words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end
If you have many repeating words you can save even more mem:
cache = Hash.new {|h,k| h[k]=k}
words =
IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end
on a sidenote, what are the efficiency issue related to the use of
IO#each vs IO#foreach(anIO) vs a simple ‘while line=gets…’ ?
Try ruby -profile with each method and see what happens. I’d guess that
there is not much difference.
Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for large
files.
Another guess… For what I know of regexps, it’s that the are compiled to
a finite state machine. This compilation is more expensive than the actual
use since an FSM only considers each input character once. The only point
in splitting at newlines is that newlines tend to occur more frequently
than a random regexp match. BTW, what I meant by more expensive is in
terms of complexity, i.e., execution time as a function of the size of the
input. If your file is large enough, the compilation of the regexp will
become negligable since the regexp is small and the file is big.
Just a guess: normally it’s not necessary and another reason might be
performance, since the overhead of a regexp might be significant for
large
files.
Another guess… For what I know of regexps, it’s that the are compiled
to
a finite state machine. This compilation is more expensive than the
actual
use since an FSM only considers each input character once. The only
point
in splitting at newlines is that newlines tend to occur more frequently
than a random regexp match. BTW, what I meant by more expensive is in
terms of complexity, i.e., execution time as a function of the size of
the
input. If your file is large enough, the compilation of the regexp will
become negligable since the regexp is small and the file is big.
A totally different reason might be that ruby regexp interface is not
suitable for streaming input. I don’t know about the C internals but from
Ruby you have to provide a String, i.e. a sequence with known length.
This is different from providing an interator that just hands out a char
at a time.
words=
IO.foreach(“myfile”) do |line|
words.push( *line.scan( /\w+/oi ) )
end
The /oi modifiers aren’t necessary.
Granted. I just grew used to putting “o” in there whenever the rx doesn’t
change over time. Kind of a documentation thingy.
If you have many repeating words you can save even more mem:
cache = Hash.new {|h,k| h[k]=k}
words =
IO.foreach(“myfile”) do |line|
words.push( *( line.scan( /\w+/oi ).map {|w| cache[w]} ) )
end
The #map isn’t doing what you think it is doing. To remove repeating
words from the list:
It does exactly what I think it’s doing. I don’t want to remove
repeated words from the list but replace all identical strings with the
same instance to save memory. map fit’s the job perfectly. Of course,
you could use collect also…