The faster way to read files

Does anybody know which is the fastest way to read a file? Lets say
there are 1, 000, 000 files with sizes not exceeding 10 kb.

Thanks in advance.

···

--
Posted via http://www.ruby-forum.com/.

What are you going to do with the files? Do you need to read all the
data before start processing them? Or you can do it sequentially?
Can you do it in a distributed way? It really depends.

-Jingjing

···

-----Original Message-----
From: Noé Alejandro [mailto:casanejo@gmail.com]
Sent: Wednesday, November 09, 2011 8:57 AM
To: ruby-talk ML
Subject: The faster way to read files

Does anybody know which is the fastest way to read a file? Lets say
there are 1, 000, 000 files with sizes not exceeding 10 kb.

Thanks in advance.

--
Posted via http://www.ruby-forum.com/.

Sequentially... The point is that I need to process each pair of files,
so I don't know which of the different ways that Ruby has to read files
is the fastest.

Thank you.

···

--
Posted via http://www.ruby-forum.com/.

Perfect! Thank you Jing.

···

--
Posted via http://www.ruby-forum.com/.

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

···

--
Posted via http://www.ruby-forum.com/.

Great! Thanks for the advice.

Greetings.

···

--
Posted via http://www.ruby-forum.com/.

For small files, there shouldn't be much of a difference reading line by line
or readling all the lines in a single sweep. Ruby IO handles the buffering
for you.

If you are really concerned about performance, why not do some
benchmarking?
http://www.ruby-doc.org/stdlib-1.9.2/libdoc/benchmark/rdoc/Benchmark.html

-Jingjing

···

-----Original Message-----
From: Noé Alejandro [mailto:casanejo@gmail.com]
Sent: Wednesday, November 09, 2011 9:38 AM
To: ruby-talk ML
Subject: Re: The fastest way to read files

Sequentially... The point is that I need to process each pair of files,
so I don't know which of the different ways that Ruby has to read files
is the fastest.

Thank you.

--
Posted via http://www.ruby-forum.com/.

What kind of processing do you need to do on those files?

Kind regards

robert

···

2011/11/9 Noé Alejandro <casanejo@gmail.com>:

Sequentially... The point is that I need to process each pair of files,
so I don't know which of the different ways that Ruby has to read files
is the fastest.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Aha. I assume you do your analysis based on words. In that case
something like this might be efficient:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}
...

words_in_file =

File.foreach a_file_name do |line|
  line.scan(/\w+/) do |word|
    word.downcase!
    words_in_file << words[word]
  end
end

Kind regards

robert

···

2011/11/11 Noé Alejandro <casanejo@gmail.com>:

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Hi Robert.

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don't need to rewrite the information,
only read them and then close them.

Regards.

···

--
Posted via http://www.ruby-forum.com/.

AFAIK, Ruby hashes have (almost) always frozen their keys.

irb(main):001:0> h = {}
=> {}
irb(main):002:0> h["blah"] = 42
=> 42
irb(main):003:0> h.keys.map(&:frozen?)
=> [true]

···

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

I don't understand: you wrote earlier you need to process pairs of
files but all these operations mentioned above can be done on a single
file. Plus, if you do not write the modified content anywhere what's
the point of the exercise? That would only burn CPU and disk IO for
nothing.

Kind regards

robert

···

2011/11/11 Noé Alejandro <casanejo@gmail.com>:

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don't need to rewrite the information,
only read them and then close them.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

I know. That's the reason why I do the freeze in the block.

Cheers

robert

···

On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

PS: It's true for String keys only.

···

On Sat, Nov 12, 2011 at 12:20 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That's the reason why I do the freeze in the block.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

I'm confused. If you know that the key is going to be frozen anyways, why freeze it?

···

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That's the reason why I do the freeze in the block.

The implicit freeze from Hash#= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#= prevents MRI[1] from
duplicating the string.

------------------------ freeze_example.rb --------------------
h = {}
frozen = "foo".freeze
h[frozen] = true

# explicitly freezing the key means the String object is stored as-is
p [ :frozen_original_key, frozen.object_id ]
p [ :frozen_key_after_aset, h.keys[0].object_id ]

h = {}
not_frozen = "foo"
h[not_frozen] = true

# Not freezing means the key stored in the hash key is a different
# object than the one provided by the user.
p [ :not_frozen_original_key, not_frozen.object_id ]
p [ :not_frozen_key_after_aset, h.keys[0].object_id ]

------------------------ Output --------------------------------
[:frozen_original_key, 70096844693360]
[:frozen_key_after_aset, 70096844693360]
[:not_frozen_original_key, 70096844693120]
[:not_frozen_key_after_aset, 70096844694120]

[1] - Verified by reading rb_hash_aset() in hash.c which eventually
      calls rb_str_new_frozen() in string.c (ruby/trunk):

rb_str_new_frozen(VALUE orig)
{
    VALUE klass, str;

    if (OBJ_FROZEN(orig)) return orig;

    ...

···

Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:
> On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:
>> On Nov 11, 2011, at 04:50 , Robert Klemme wrote:
>>
>>> # ensure every word is only once in memory
>>> words = Hash.new {|h,k| k.freeze; h[k] = k}
>>
>> AFAIK, Ruby hashes have (almost) always frozen their keys.
>
> I know. That's the reason why I do the freeze in the block.

I'm confused. If you know that the key is going to be frozen anyways,
why freeze it?

Exactly. And in that case we would end up with two objects in memory
where one is sufficient:

irb(main):008:0> s = "foo"
=> "foo"
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k}
=> {}
irb(main):010:0> h[s]
=> "foo"
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {"foo"=>"foo"}

Kind regards

robert

···

On Sat, Nov 12, 2011 at 1:26 AM, Eric Wong <normalperson@yhbt.net> wrote:

Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:
> On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:
>> On Nov 11, 2011, at 04:50 , Robert Klemme wrote:
>>
>>> # ensure every word is only once in memory
>>> words = Hash.new {|h,k| k.freeze; h[k] = k}
>>
>> AFAIK, Ruby hashes have (almost) always frozen their keys.
>
> I know. That's the reason why I do the freeze in the block.

I'm confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#= prevents MRI[1] from
duplicating the string.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

-----Messaggio originale-----

···

Da: Robert Klemme [mailto:shortcutter@googlemail.com]
Inviato: sabato 12 novembre 2011 12:35
A: ruby-talk ML
Oggetto: Re: The fastest way to read files

On Sat, Nov 12, 2011 at 1:26 AM, Eric Wong <normalperson@yhbt.net> wrote:

Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:
> On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:
>> On Nov 11, 2011, at 04:50 , Robert Klemme wrote:
>>
>>> # ensure every word is only once in memory words = Hash.new
>>> {|h,k| k.freeze; h[k] = k}
>>
>> AFAIK, Ruby hashes have (almost) always frozen their keys.
>
> I know. That's the reason why I do the freeze in the block.

I'm confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#= prevents MRI[1] from
duplicating the string.

Exactly. And in that case we would end up with two objects in memory where
one is sufficient:

irb(main):008:0> s = "foo"
=> "foo"
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k} => {} irb(main):010:0> h[s]
=> "foo"
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {"foo"=>"foo"}

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Capodanno a Riccione, Pacchetto Relax: Mezza Pensione + bagno turco + solarium + massaggio. Wifi e parcheggio gratis. 2 giorni euro 199 a persona
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid978&d)-12