The faster way to read files

Cassna_Capriet · 9 November 2011 16:56

Does anybody know which is the fastest way to read a file? Lets say
there are 1, 000, 000 files with sizes not exceeding 10 kb.

Thanks in advance.

···

--
Posted via http://www.ruby-forum.com/.

Duan_Jingjing · 9 November 2011 17:23

What are you going to do with the files? Do you need to read all the
data before start processing them? Or you can do it sequentially?
Can you do it in a distributed way? It really depends.

-Jingjing

···

-----Original Message-----
From: Noé Alejandro [mailto:casanejo@gmail.com]
Sent: Wednesday, November 09, 2011 8:57 AM
To: ruby-talk ML
Subject: The faster way to read files

Does anybody know which is the fastest way to read a file? Lets say
there are 1, 000, 000 files with sizes not exceeding 10 kb.

Thanks in advance.

--
Posted via http://www.ruby-forum.com/.

Cassna_Capriet · 9 November 2011 17:38

Sequentially... The point is that I need to process each pair of files,
so I don't know which of the different ways that Ruby has to read files
is the fastest.

Thank you.

···

--
Posted via http://www.ruby-forum.com/.

Cassna_Capriet · 9 November 2011 18:23

Perfect! Thank you Jing.

···

--
Posted via http://www.ruby-forum.com/.

Cassna_Capriet · 11 November 2011 12:33

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

···

--
Posted via http://www.ruby-forum.com/.

Cassna_Capriet · 11 November 2011 15:01

Great! Thanks for the advice.

Greetings.

···

--
Posted via http://www.ruby-forum.com/.

Duan_Jingjing · 9 November 2011 17:54

For small files, there shouldn't be much of a difference reading line by line
or readling all the lines in a single sweep. Ruby IO handles the buffering
for you.

If you are really concerned about performance, why not do some
benchmarking?
http://www.ruby-doc.org/stdlib-1.9.2/libdoc/benchmark/rdoc/Benchmark.html

-Jingjing

···

-----Original Message-----
From: Noé Alejandro [mailto:casanejo@gmail.com]
Sent: Wednesday, November 09, 2011 9:38 AM
To: ruby-talk ML
Subject: Re: The fastest way to read files

Sequentially... The point is that I need to process each pair of files,
so I don't know which of the different ways that Ruby has to read files
is the fastest.

Thank you.

--
Posted via http://www.ruby-forum.com/.

Robert_K1 · 10 November 2011 13:16

What kind of processing do you need to do on those files?

Kind regards

robert

···

2011/11/9 Noé Alejandro <casanejo@gmail.com>:

Sequentially... The point is that I need to process each pair of files,
so I don't know which of the different ways that Ruby has to read files
is the fastest.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 11 November 2011 12:50

Aha. I assume you do your analysis based on words. In that case
something like this might be efficient:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}
...

words_in_file =

File.foreach a_file_name do |line|
  line.scan(/\w+/) do |word|
    word.downcase!
    words_in_file << words[word]
  end
end

Kind regards

robert

···

2011/11/11 Noé Alejandro <casanejo@gmail.com>:

I mean, the final processing is about compare (preprocessed) content of
each pair of texts. So, I open a file, I remove blanks and so on, and I
record the information in a data structure. Then I open other file,
remove blanks and so on, and I record this new information in other data
structure. Now I have preprocessed information of a pair of texts, and
then I apply other processing to it.

I repeat previous steps for each text.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Cassna_Capriet · 11 November 2011 12:04

Hi Robert.

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don't need to rewrite the information,
only read them and then close them.

Regards.

···

--
Posted via http://www.ruby-forum.com/.

Ryan_Davis1 · 11 November 2011 21:07

AFAIK, Ruby hashes have (almost) always frozen their keys.

irb(main):001:0> h = {}
=> {}
irb(main):002:0> h["blah"] = 42
=> 42
irb(main):003:0> h.keys.map(&:frozen?)
=> [true]

···

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

Robert_K1 · 11 November 2011 12:15

I don't understand: you wrote earlier you need to process pairs of
files but all these operations mentioned above can be done on a single
file. Plus, if you do not write the modified content anywhere what's
the point of the exercise? That would only burn CPU and disk IO for
nothing.

Kind regards

robert

···

2011/11/11 Noé Alejandro <casanejo@gmail.com>:

Basically read their content in order to remove blanks, punctuation
marks and lowercasing the text. I don't need to rewrite the information,
only read them and then close them.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 11 November 2011 23:20

I know. That's the reason why I do the freeze in the block.

Cheers

robert

···

On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Robert_K1 · 11 November 2011 23:21

PS: It's true for String keys only.

···

On Sat, Nov 12, 2011 at 12:20 AM, Robert Klemme <shortcutter@googlemail.com> wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That's the reason why I do the freeze in the block.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Ryan_Davis1 · 12 November 2011 00:02

I'm confused. If you know that the key is going to be frozen anyways, why freeze it?

···

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:

On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 04:50 , Robert Klemme wrote:

# ensure every word is only once in memory
words = Hash.new {|h,k| k.freeze; h[k] = k}

AFAIK, Ruby hashes have (almost) always frozen their keys.

I know. That's the reason why I do the freeze in the block.

Eric_Wong2 · 12 November 2011 00:26

The implicit freeze from Hash#= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#= prevents MRI[1] from
duplicating the string.

------------------------ freeze_example.rb --------------------
h = {}
frozen = "foo".freeze
h[frozen] = true

# explicitly freezing the key means the String object is stored as-is
p [ :frozen_original_key, frozen.object_id ]
p [ :frozen_key_after_aset, h.keys[0].object_id ]

h = {}
not_frozen = "foo"
h[not_frozen] = true

# Not freezing means the key stored in the hash key is a different
# object than the one provided by the user.
p [ :not_frozen_original_key, not_frozen.object_id ]
p [ :not_frozen_key_after_aset, h.keys[0].object_id ]

------------------------ Output --------------------------------
[:frozen_original_key, 70096844693360]
[:frozen_key_after_aset, 70096844693360]
[:not_frozen_original_key, 70096844693120]
[:not_frozen_key_after_aset, 70096844694120]

[1] - Verified by reading rb_hash_aset() in hash.c which eventually
calls rb_str_new_frozen() in string.c (ruby/trunk):

rb_str_new_frozen(VALUE orig)
{
VALUE klass, str;

if (OBJ_FROZEN(orig)) return orig;

...

···

Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:
> On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:
>> On Nov 11, 2011, at 04:50 , Robert Klemme wrote:
>>
>>> # ensure every word is only once in memory
>>> words = Hash.new {|h,k| k.freeze; h[k] = k}
>>
>> AFAIK, Ruby hashes have (almost) always frozen their keys.
>
> I know. That's the reason why I do the freeze in the block.

I'm confused. If you know that the key is going to be frozen anyways,
why freeze it?

Robert_K1 · 12 November 2011 11:34

Exactly. And in that case we would end up with two objects in memory
where one is sufficient:

irb(main):008:0> s = "foo"
=> "foo"
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k}
=> {}
irb(main):010:0> h[s]
=> "foo"
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {"foo"=>"foo"}

Kind regards

robert

···

On Sat, Nov 12, 2011 at 1:26 AM, Eric Wong <normalperson@yhbt.net> wrote:

Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:
> On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:
>> On Nov 11, 2011, at 04:50 , Robert Klemme wrote:
>>
>>> # ensure every word is only once in memory
>>> words = Hash.new {|h,k| k.freeze; h[k] = k}
>>
>> AFAIK, Ruby hashes have (almost) always frozen their keys.
>
> I know. That's the reason why I do the freeze in the block.

I'm confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#= prevents MRI[1] from
duplicating the string.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Luca_Email · 29 December 2011 06:55

-----Messaggio originale-----

···

Da: Robert Klemme [mailto:shortcutter@googlemail.com]
Inviato: sabato 12 novembre 2011 12:35
A: ruby-talk ML
Oggetto: Re: The fastest way to read files

On Sat, Nov 12, 2011 at 1:26 AM, Eric Wong <normalperson@yhbt.net> wrote:

Ryan Davis <ryand-ruby@zenspider.com> wrote:

On Nov 11, 2011, at 15:20 , Robert Klemme wrote:
> On Fri, Nov 11, 2011 at 10:07 PM, Ryan Davis <ryand-ruby@zenspider.com> wrote:
>> On Nov 11, 2011, at 04:50 , Robert Klemme wrote:
>>
>>> # ensure every word is only once in memory words = Hash.new
>>> {|h,k| k.freeze; h[k] = k}
>>
>> AFAIK, Ruby hashes have (almost) always frozen their keys.
>
> I know. That's the reason why I do the freeze in the block.

I'm confused. If you know that the key is going to be frozen anyways,
why freeze it?

The implicit freeze from Hash#= duplicates the string and freezes
the duplicate, not the same object given by the user.

Explicitly freezing the key before Hash#= prevents MRI[1] from
duplicating the string.

Exactly. And in that case we would end up with two objects in memory where
one is sufficient:

irb(main):008:0> s = "foo"
=> "foo"
irb(main):009:0> h = Hash.new {|ha,k| ha[k]=k} => {} irb(main):010:0> h[s]
=> "foo"
irb(main):011:0> h.each {|k,v| puts k.object_id, v.object_id}
137705420
137645970
=> {"foo"=>"foo"}

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

--
Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f

Sponsor:
Capodanno a Riccione, Pacchetto Relax: Mezza Pensione + bagno turco + solarium + massaggio. Wifi e parcheggio gratis. 2 giorni euro 199 a persona
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid978&d)-12

Topic		Replies	Views
Is lots of files with Threads faster? ruby-talk	10	89	11 February 2008
Read efficiency? ruby-talk	3	81	22 February 2010
Fastest way to parse millions of file ruby-talk	6	174	23 October 2013
Fast way to process large files line by line ruby-talk	19	121	17 November 2006
File.read(fname) vs. File.read(fname,File.size(fname)) ruby-talk	4	116	1 May 2010

The faster way to read files

Related Topics