*Fast* way to process large files line by line

Hi Folks,

  I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

  For this I am using the following psuedocode:

  outputfile.open
  open all files f1..fn
  while (!(all files have eof))
    (f1..fn).each{|f|
      next if f.eof
      line = f.readline
      parse the line, and get a structure P out of it
      put P into a hashtable: H[P.time] << P

      check for eof conditions on f

      if (H has more than k keys ? (ie has it become very large))
        H.keys.sort{|t|
          outputfile << Marshal.dump(H[t])
          H.delete(t)
        }
      end
    }
  end
  close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

  i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

  ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

  iii. Is it bad to have around 40-50 files opened at the same time ?.

  iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.

  v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

  vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

    a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

    b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.

    c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
    d. How much of the slowness is due to the fact that it is ruby and not
C ?

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

--Devesh Agrawal

···

--
Posted via http://www.ruby-forum.com/.

Could you not parrallelise the processing of each file? Perhaps using
something like starfish (http://www.rufy.com/starfish/doc/\)?

Farrel

···

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:

Hi Folks,

        I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

        For this I am using the following psuedocode:

        outputfile.open
        open all files f1..fn
        while (!(all files have eof))
                (f1..fn).each{|f|
                        next if f.eof
                        line = f.readline
                        parse the line, and get a structure P out of it
                        put P into a hashtable: H[P.time] << P

                        check for eof conditions on f

                        if (H has more than k keys ? (ie has it become very large))
                                H.keys.sort{|t|
                                        outputfile << Marshal.dump(H[t])
                                        H.delete(t)
                                }
                        end
                }
        end
        close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

        i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

        ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

        iii. Is it bad to have around 40-50 files opened at the same time ?.

        iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a
problem.

        v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

        vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

                a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

                b. Try to reduce the memory consumption of this by reducing k further
so as the program doesn't page in/out.

                c. Can someone point me to a good sample code for reading a file line
by line in C and then putting it into a ruby hashtable ?.
                d. How much of the slowness is due to the fact that it is ruby and not
C ?

To give you an idea of how slow this is actually: Just reading all the
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6
days !!. And I am quite unable to run profile on my code as it is just
too slow.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

--Devesh Agrawal

--
Posted via http://www.ruby-forum.com/\.

Hi Folks,

  I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

  For this I am using the following psuedocode:

  outputfile.open
  open all files f1..fn
  while (!(all files have eof))
    (f1..fn).each{|f|
      next if f.eof
      line = f.readline
      parse the line, and get a structure P out of it
      put P into a hashtable: H[P.time] << P

      check for eof conditions on f

      if (H has more than k keys ? (ie has it become very large))
        H.keys.sort{|t|
          outputfile << Marshal.dump(H[t])
          H.delete(t)
        }
      end
    }
  end
  close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

Have you profiled? Where is your time really coming from?

Repost with a profile and then we can give some real suggestions.

  i. How fast is f.readline ?. I want to use the maximum buffering
possible for largest speed gains. In ruby how do I set the buffer size.
I looked through io.c, and it seems that readline essentially uses getc
(stopping when it gets a newline). How can I set the buffer size for the
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

I seriously doubt that this is your choke-point.

  ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
even worse.

Marshal.dump is pretty fast, probably as fast as you're going to get for a serialization format. _why did some benchmarks back in the day and it beat out the other P languages.

That said, why are you even using it? Why not just add raw strings?

  v. Would coding the realine part in C using rubyinline offer me speed
advantages ?

No.

(or, very unlikely)

  vi. I am thinking of trying the following to reduce the time it takes,
I would very much welcome your comments:

Profile, profile, profile.

    a. Remove Marshal.dump [I don't need to strictly serialize objects,
only dump the data and read it back] and replace it with some string
form which is more compact. Actually is it possible to have something
like fixed length structures like in C: Example I would want P to be
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
the IO would be faster as I could just dump a fixed number of bytes to a
file.

Yes, do this, simpler is better.

Try #pack and #unpack.

    b. Try to reduce the memory consumption of this by reducing k further so as the program doesn't page in/out.

You already said it isn't paging...

    c. Can someone point me to a good sample code for reading a file line by line in C and then putting it into a ruby hashtable ?.

No. Profile, profile, profile.

    d. How much of the slowness is due to the fact that it is ruby and not C ?

We can't tell you without a profile. Profile, profile, profile.

To give you an idea of how slow this is actually: Just reading all the
files line by line takes around 8-9 hrs. Whereas the above thing easily takes
5-6 days !!. And I am quite unable to run profile on my code as it is just
too slow.

Lies.

Use a reduced dataset and with ruby-prof or zenprofile.

You know nothing without a profile.

I would be very grateful for your comments, and particularly if you have
any suggestions/experience on doing this in a fast way.

Profile it, you can't make sane changes without one.

···

On Nov 15, 2006, at 11:21 AM, Devesh Agrawal wrote:

--
Eric Hodel - drbrain@segment7.net - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com

Devesh Agrawal wrote:

  I am using ruby to analyse a huge (around 60G) amount of my networking experiment data. Let me briefly describe my technique: I have to read around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi contains traceroutes to lots of destinations at different times. I.E a file is basically a list of traceroutes launched from a given src (src = filename) launched at diff times. I want to get a structure like following: (list of all traceroutes from *all* src's at time 1), (list of all traceroutes from *all* src's at time 2)... and so on.

//Btw I can't use an array instead of a hashtable H, as the P.time's read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

First I have a question: why do you read those files in parallel in the first place?

  i. How fast is f.readline ?. I want to use the maximum buffering possible for largest speed gains. In ruby how do I set the buffer size. I looked through io.c, and it seems that readline essentially uses getc (stopping when it gets a newline). How can I set the buffer size for the underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

  ii. Marshal.dump is also very slow. Is there an alternative, Yaml is even worse.

No, Marshal is actually pretty fast. It may be due to the other IO you do or because of the data you write.

  iii. Is it bad to have around 40-50 files opened at the same time ?.

No, but reading from all those files in /parallel/ is. It is of course platform dependent how the IO susystem deals with that but chances are that the disk heads have to move back and forth between all the files.

  iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a problem.

It is better to not believe but know that paging is not an issue.

  v. Would coding the realine part in C using rubyinline offer me speed advantages ?

Very unlikely.

  vi. I am thinking of trying the following to reduce the time it takes, I would very much welcome your comments:

    a. Remove Marshal.dump [I don't need to strictly serialize objects, only dump the data and read it back] and replace it with some string form which is more compact. Actually is it possible to have something like fixed length structures like in C: Example I would want P to be like this: Struct P{ char foo[100], int a[100]} ?. So this way I think the IO would be faster as I could just dump a fixed number of bytes to a file.

Don't do the writing in a temp file while you are reading. This poses even more burden on your IO subsystem. Btw, what filesystem do you use? You're not happening to be on Suse with ReiserFS?

    b. Try to reduce the memory consumption of this by reducing k further so as the program doesn't page in/out.

Without /knowing/ that paging is an issue this does not make sense.

    c. Can someone point me to a good sample code for reading a file line by line in C and then putting it into a ruby hashtable ?.
    d. How much of the slowness is due to the fact that it is ruby and not C ?

Here's my assessment: you do not have a programming language but a design problem. Reading from multiple large files at the same time is inefficient.

To give you an idea of how slow this is actually: Just reading all the files
line by line takes around 8-9 hrs.

You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems unrealistically slow. Is this old hardware? Do you have drive problems? Are there any other disk intensive tasks going on (a busy DB or web server) on that machine?

> Whereas the above thing easily takes

5-6
days !!. And I am quite unable to run profile on my code as it is just too slow.

Clearly.

I would be very grateful for your comments, and particularly if you have any suggestions/experience on doing this in a fast way.

Here's my advice: rewrite the program to read those files sequentially because that is likely faster on most systems. Remove the temp file writing while files are read. And find out why your IO system is performing so badly.

Kind regards

  robert

<snip commentary>

use threads:

harp:~ > ls -ltarh data/big
-rw-rw-r-- 1 ahoward ahoward 251M Nov 15 16:41 data/big

harp:~ > wc -l data/big
7131456 data/big

harp:~ > head data/big
1163633749.75535 = 0.0877033759844916
1163633749.75565 = 0.913142160532852
1163633749.75569 = 0.604664544000001
1163633749.75571 = 0.233515496811041
1163633749.75574 = 0.221557802561386
1163633749.75577 = 0.241982349948893
1163633749.7558 = 0.190149667316971
1163633749.75583 = 0.0827503931446325
1163633749.75586 = 0.656736160359522
1163633749.75588 = 0.222579171509354

using my threaded program i can process these lines at a rate of about

harp:~ > ruby a.rb data/big 42 42
rate : 0.0 lines/second
rate : 14094.9956719788 lines/second
rate : 21059.6378674178 lines/second
rate : 22742.0758748341 lines/second
rate : 20527.6435560232 lines/second
rate : 16541.8249949497 lines/second
rate : 18295.1759919063 lines/second
rate : 19648.3251130277 lines/second

so, let's say about 20,000 lines per second - so it'd take about 5 minutes to
process my 250mb file. i realize yours is more complex, but lets say you
should be able to do 1gb data on a reasonably fast machine in something like
20-30 minutes.

btw. map won't help (much) since you are just doing a sequential read of ALl
of the data anyhow...

here's the code

harp:~ > cat a.rb

class LineProducer
   require 'thread'
   EOF = nil

   class EOFQueue < ::Queue
     def push arg
       if arg == EOF
         class << self; def pop() EOF end; end
       end
     ensure
       return super
     end
   end

   def initialize path, bufsize
     @path = path
     @bufsize = Integer bufsize
     @lines = EOFQueue.new
     @thread = new_reader
   end

   def join() @thread.join end

   def new_reader
     Thread.new do
       Thread.current.abort_on_exception = true
       buf = ''
       open(@path) do |f|
         while((buf2 = f.read @bufsize))
           buf << buf2
           lines = buf.scan %r/^.*$\n?/
           if lines.last =~ %r/$\n/
             buf = ''
           else
             buf = lines.pop
           end
           @lines.push lines
         end
       end
       @lines.push EOF
     end
   end

   def new_consumer &b
     Thread.new{
       Thread.current.abort_on_exception = true
       while((lines = @lines.pop)); b[ lines ]; end
     }
   end
end

class ThreadSafeHash
   require 'sync'
   instance_methods.each{|m| undef_method m unless m[%r/__/]}
   def initialize
     @hash = {}
     @sync = Sync.new
   end
   def method_missing m, *a, &b
     @sync.synchronize{ @hash.send m, *a, &b }
   end
   def respond_to? m, *a, &b
     @sync.synchronize{ @hash.respond_to? m, *a, &b }
   end
end

path = ARGV.shift or abort 'no path'
n = Integer(ARGV.shift || 42)
hn = Integer(ARGV.shift || 42)
mb = 2 ** 20

lines_producer = LineProducer.new path, mb
tsh = ThreadSafeHash.new

start = Time.now.to_f
elapsed = lambda{|start| Time.now.to_f - start.to_f}
progress = Thread.new{ loop{ puts "rate : #{ tsh.size / elapsed[start] } lines/second"; sleep 2 } }

parse = lambda{|line| line.split(%r/=/).map{|word| word.strip}}

threads = Array.new(n).map{
   lines_producer.new_consumer do |lines|
     h = {} and lines.each do |line|
       k, v = parse[ line ]
       h[k] = v
     end
     tsh.update h
   end
}

lines_producer.join
threads.map{|t| t.join}

regards.

-a

···

On Thu, 16 Nov 2006, Devesh Agrawal wrote:

Hi Folks,

  I am using ruby to analyse a huge (around 60G) amount of my networking
experiment data. Let me briefly describe my technique: I have to read
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
contains traceroutes to lots of destinations at different times. I.E a
file is basically a list of traceroutes launched from a given src (src =
filename) launched at diff times. I want to get a structure like
following: (list of all traceroutes from *all* src's at time 1), (list
of all traceroutes from *all* src's at time 2)... and so on.

  For this I am using the following psuedocode:

  outputfile.open
  open all files f1..fn
  while (!(all files have eof))
    (f1..fn).each{|f|
      next if f.eof
      line = f.readline
      parse the line, and get a structure P out of it
      put P into a hashtable: H[P.time] << P

      check for eof conditions on f

      if (H has more than k keys ? (ie has it become very large))
        H.keys.sort{|t|
          outputfile << Marshal.dump(H[t])
          H.delete(t)
        }
      end
    }
  end
  close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

--
my religion is very simple. my religion is kindness. -- the dalai lama

Devesh Agrawal wrote:

Hi Folks,

  I am using ruby to analyse a huge (around 60G) amount of my networking experiment data. Let me briefly describe my technique: I have to read around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi contains traceroutes to lots of destinations at different times. I.E a file is basically a list of traceroutes launched from a given src (src = filename) launched at diff times. I want to get a structure like following: (list of all traceroutes from *all* src's at time 1), (list of all traceroutes from *all* src's at time 2)... and so on.

  For this I am using the following psuedocode:

  outputfile.open
  open all files f1..fn
  while (!(all files have eof))
    (f1..fn).each{|f|
      next if f.eof
      line = f.readline
      parse the line, and get a structure P out of it
      put P into a hashtable: H[P.time] << P

      check for eof conditions on f

      if (H has more than k keys ? (ie has it become very large))
        H.keys.sort{|t|
          outputfile << Marshal.dump(H[t])
          H.delete(t)
        }
      end
    }
  end
  close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

  i. How fast is f.readline ?. I want to use the maximum buffering possible for largest speed gains. In ruby how do I set the buffer size. I looked through io.c, and it seems that readline essentially uses getc (stopping when it gets a newline). How can I set the buffer size for the underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

  ii. Marshal.dump is also very slow. Is there an alternative, Yaml is even worse.

  iii. Is it bad to have around 40-50 files opened at the same time ?.

  iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a problem.

  v. Would coding the realine part in C using rubyinline offer me speed advantages ?

  vi. I am thinking of trying the following to reduce the time it takes, I would very much welcome your comments:

    a. Remove Marshal.dump [I don't need to strictly serialize objects, only dump the data and read it back] and replace it with some string form which is more compact. Actually is it possible to have something like fixed length structures like in C: Example I would want P to be like this: Struct P{ char foo[100], int a[100]} ?. So this way I think the IO would be faster as I could just dump a fixed number of bytes to a file.

    b. Try to reduce the memory consumption of this by reducing k further so as the program doesn't page in/out.

    c. Can someone point me to a good sample code for reading a file line by line in C and then putting it into a ruby hashtable ?.
    d. How much of the slowness is due to the fact that it is ruby and not C ?

To give you an idea of how slow this is actually: Just reading all the files
line by line takes around 8-9 hrs. Whereas the above thing easily takes 5-6
days !!. And I am quite unable to run profile on my code as it is just too slow.

I would be very grateful for your comments, and particularly if you have any suggestions/experience on doing this in a fast way.

--Devesh Agrawal
  

Basically what you are describing here looks like a poorly specified database load followed by queries. Rather than trying to process 60 GB of data from a bunch of flat files in and out of data structures in an interpreted dynamic garbage collected language, what I would recommend is the following:

1. Pick up the book "Data Crunching" from Pragmatic Programmers. Learn about databases, tables and queries.
2. Choose a convenient database. I think for something this size, your options are Microsoft SQL Server Express and PostgreSQL, unless you can afford Oracle. This is way too big for MS Access and possibly for MySQL, although I haven't tried anything on this scale with MySQL, just SQL Server and PostgreSQL.
3. Write a *simple* line-by-line Ruby script to just parse your data and insert the rows into a database. I haven't looked at traceroute data recently and I don't quite understand the queries you'd want, so I can't give you hints on database design/table structure other than that you need one for a problem of this magnitude.
4. Once you've got the data loaded into the database, there are quite a few tools you can use to make the data mining and querying easier. Again, since I don't really understand your objectives, I can't pick one for you. If you're in the mood to learn Rails, you can probably do some interesting things that way.

I do a lot of stuff like this. Usually I massage the data with Perl into a CSV format, then load it manually into PostgreSQL. After that, I do my data mining with R using ODBC. But you can do all of it in Ruby. The key is to use an industrial strength database for 60 GB worth of data you're trying to mine.

···

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

Hi,
Farrel Lifson wrote:

        close all files
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
        v. Would coding the realine part in C using rubyinline offer me speed
the IO would be faster as I could just dump a fixed number of bytes to a
To give you an idea of how slow this is actually: Just reading all the

--
Posted via http://www.ruby-forum.com/\.

Could you not parrallelise the processing of each file? Perhaps using
something like starfish (http://www.rufy.com/starfish/doc/\)?

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

···

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:

Farrel

--
Posted via http://www.ruby-forum.com/\.

Hi Everyone, Thanks for replying.

First a couple of stupid things that I am doing, then a couple of more
questions.

Stupid things:
i. I parse each line, but that essentially amounts to spliting it and
such. So maybe I should be better off using scanf kind of stuff and
also compiling regexps once and using them for each line. I heard that
helps.

ii. Going thru my hash H to delete keys is rather stupid. A better way
is to just dump it and trash it.

iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
line and parse it (with my rather non efficient parse function). And I
did this one file at a time in sequence

iv. I will definitely try to profile my code. I did use -rprofile but
that kind of revealed the obvious that most of my time was spent in
reading file loop > marshalling > parsing. Plus it was rather slow for
even 100MB of stuff.

Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
server runs Centos. And I think it is pretty well managed.

Now my questions:

i. I was earlier doing this: Read each file in sequence, and then
write out a similar hash, containing data from one src for each time.
Even with this hash, I used to dump it to temporary files (one per hash
key = time instant) and then after doing this for every file (src's). I
would then merge all this data in the temporary files into what I want.
This was slow, (but in retrospect not as slow as what I am doing now by
reading in parrallel). The reason I though opening multiple files and
reading from them line by line would be faster is cos I wouldn't have
to open/close all those thousands of temporary files.
I still don't see the reason why doing things in parrallel would screw
things up ? As my assumption is that the disk head is anyway pretty
wild as the system will involve a lot of IO b/w context switches.

ii. Also why do you think that writing another file in the midst of
reading (one or more) one is a bad idea ?. The only way I can avoid
this is to chunk up files into smaller units and then process them,
writing their results onto temp files and then proceeding along further
with other files. Which one do you think is a better thing to do ?

I will now try the following (based on your suggestions):
i. Avoid heavy weight regexps in parsing lines
ii. Avoid marshaling/dumping
iii. Focus on doing things a file at a time.
iv. If all else fails, I will try using something like sharkfish etc.

Btw will using something like an mmap extension for ruby speed things
up for me ?.

Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
new to using ruby.

···

On Nov 15, 4:24 pm, Robert Klemme <shortcut...@googlemail.com> wrote:

Devesh Agrawal wrote:
> I am using ruby to analyse a huge (around 60G) amount of my networking
> experiment data. Let me briefly describe my technique: I have to read
> around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
> contains traceroutes to lots of destinations at different times. I.E a
> file is basically a list of traceroutes launched from a given src (src =
> filename) launched at diff times. I want to get a structure like
> following: (list of all traceroutes from *all* src's at time 1), (list
> of all traceroutes from *all* src's at time 2)... and so on.
> //Btw I can't use an array instead of a hashtable H, as the P.time's
> read across all files needn't be same.

> This is performing miserbly SLOW. I have the following questions:First I have a question: why do you read those files in parallel in the
first place?

> i. How fast is f.readline ?. I want to use the maximum buffering
> possible for largest speed gains. In ruby how do I set the buffer size.
> I looked through io.c, and it seems that readline essentially uses getc
> (stopping when it gets a newline). How can I set the buffer size for the
> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

> ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
> even worse.No, Marshal is actually pretty fast. It may be due to the other IO you
do or because of the data you write.

> iii. Is it bad to have around 40-50 files opened at the same time ?.No, but reading from all those files in /parallel/ is. It is of course
platform dependent how the IO susystem deals with that but chances are
that the disk heads have to move back and forth between all the files.

> iv. The program does use a lot of memory but not so much, around 30-40
> pc of 1G ram machine is used by it. So I think paging in/out is not a
> problem.It is better to not believe but know that paging is not an issue.

> v. Would coding the realine part in C using rubyinline offer me speed
> advantages ?Very unlikely.

> vi. I am thinking of trying the following to reduce the time it takes,
> I would very much welcome your comments:

> a. Remove Marshal.dump [I don't need to strictly serialize objects,
> only dump the data and read it back] and replace it with some string
> form which is more compact. Actually is it possible to have something
> like fixed length structures like in C: Example I would want P to be
> like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
> the IO would be faster as I could just dump a fixed number of bytes to a
> file.Don't do the writing in a temp file while you are reading. This poses
even more burden on your IO subsystem. Btw, what filesystem do you use?
  You're not happening to be on Suse with ReiserFS?

> b. Try to reduce the memory consumption of this by reducing k further
> so as the program doesn't page in/out.Without /knowing/ that paging is an issue this does not make sense.

> c. Can someone point me to a good sample code for reading a file line
> by line in C and then putting it into a ruby hashtable ?.
> d. How much of the slowness is due to the fact that it is ruby and not
> C ?Here's my assessment: you do not have a programming language but a
design problem. Reading from multiple large files at the same time is
inefficient.

> To give you an idea of how slow this is actually: Just reading all the
> files
> line by line takes around 8-9 hrs.You need 8 hours just for reading 60GB? That's 2.1MB/s - this seems
unrealistically slow. Is this old hardware? Do you have drive
problems? Are there any other disk intensive tasks going on (a busy DB
or web server) on that machine?

> Whereas the above thing easily takes

> 5-6
> days !!. And I am quite unable to run profile on my code as it is just
> too slow.Clearly.

> I would be very grateful for your comments, and particularly if you have
> any suggestions/experience on doing this in a fast way.Here's my advice: rewrite the program to read those files sequentially
because that is likely faster on most systems. Remove the temp file
writing while files are read. And find out why your IO system is
performing so badly.

Kind regards

        robert

I think this
     if (H has more than k keys ? (ie has it become very large))
                               H.keys.sort{|t|
                                       outputfile << Marshal.dump(H[t])
                                       H.delete(t)
                               }
                       end
can just be changed to
  if (H has more than k keys ?)
    outputfile << Marshal.dump(H)
  end
  H = {}

Farrel

···

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:

Hi,
Farrel Lifson wrote:
> On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
>>
>> close all files
>> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>> v. Would coding the realine part in C using rubyinline offer me speed
>> the IO would be faster as I could just dump a fixed number of bytes to a
>> To give you an idea of how slow this is actually: Just reading all the
>>
>> --
>> Posted via http://www.ruby-forum.com/\.
>>
>
> Could you not parrallelise the processing of each file? Perhaps using
> something like starfish (http://www.rufy.com/starfish/doc/\)?

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

> Farrel

--
Posted via http://www.ruby-forum.com/\.

Also try running your code with some sample data through the ruby
profiler (just run 'ruby -rprofiler yourcode.rb') and it should give
you an idea where your program is spending it's time.

Farrel

···

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:

Hi,
Farrel Lifson wrote:
> On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
>>
>> close all files
>> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>> v. Would coding the realine part in C using rubyinline offer me speed
>> the IO would be faster as I could just dump a fixed number of bytes to a
>> To give you an idea of how slow this is actually: Just reading all the
>>
>> --
>> Posted via http://www.ruby-forum.com/\.
>>
>
> Could you not parrallelise the processing of each file? Perhaps using
> something like starfish (http://www.rufy.com/starfish/doc/\)?

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

> Farrel

--
Posted via http://www.ruby-forum.com/\.

Maybe it's possible to preprocess the files with grep or something
similar (like splitting into hour-long slices, etc.), using ruby just
for the sorting and merging - that way you could make it parallel, and
besides, grep&co are supposed to be much faster.

···

On 11/15/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:

Hi,
Farrel Lifson wrote:
> On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
>>
>> close all files
>> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
>> v. Would coding the realine part in C using rubyinline offer me speed
>> the IO would be faster as I could just dump a fixed number of bytes to a
>> To give you an idea of how slow this is actually: Just reading all the
>>
>> --
>> Posted via http://www.ruby-forum.com/\.
>>
>
> Could you not parrallelise the processing of each file? Perhaps using
> something like starfish (http://www.rufy.com/starfish/doc/\)?

Did you mean parrallelizing across multiple files or parrallelizing the
processing of one file ?

Yes and No. But this involves me getting deeper into a description of my
problem:

Each file has traceroutes at lots of times t1,t2.... The objective is to
collect all traceroutes that happened at that time into one structure.
Hence I could do something like this: Using ruby threads (or whatever)
read each file, and store it into one *common* hashtable. And then call
the syncing of the hashtable to the disk incase it has grown large
enough.

I was more hoping for some kind of a fast way to readlines, using say
mmap or something like that. I read a few posts about how using mmap
helped someone else. I will look into this starfish, I rejected ruby
threads as unlike pthreads they weren't really true threads.

Thanks for replying. Is there something wrong or inherently slow with
the things I am doing ?. Can It be speeded up ?.

> Farrel

devesh wrote:

/ ...

I still don't see the reason why doing things in parrallel would screw
things up ?

If you read from more than one file simultaneously, the drive heads must
constantly slew back and forth from the location of one file to another.
The more files open, the worse this becomes. Files are laid out on a drive
in a logical way, and the data should be read in that same logical way --
one file at a time.

Imagine shopping for groceries and washing your car simultaneously. A lot of
moving back and forth to do them at once, don't you think? Maybe washing
your car, then shopping, would be a better use of your time.

As my assumption is that the disk head is anyway pretty
wild as the system will involve a lot of IO b/w context switches.

No matter how bad it is, you can always make it worse by opening multiple
files.

ii. Also why do you think that writing another file in the midst of
reading (one or more) one is a bad idea ?.

Same reason -- disk thrashing.

···

--
Paul Lutus
http://www.arachnoid.com

Hi Everyone, Thanks for replying.

First a couple of stupid things that I am doing, then a couple of more
questions.

Stupid things:
i. I parse each line, but that essentially amounts to spliting it and
such. So maybe I should be better off using scanf kind of stuff and
also compiling regexps once and using them for each line. I heard that
helps.

Using regexps inline is the most efficient way in Ruby. You can find numerous benchmark postings in this group.

ii. Going thru my hash H to delete keys is rather stupid. A better way
is to just dump it and trash it.

I don't exactly understand what you mean but that's probably because I still do not have a clear picture of the processing your are doing.

iii. When I said it takes 8-9 hours, I meant 8-9 hours to read each
line and parse it (with my rather non efficient parse function). And I
did this one file at a time in sequence

iv. I will definitely try to profile my code. I did use -rprofile but
that kind of revealed the obvious that most of my time was spent in
reading file loop > marshalling > parsing. Plus it was rather slow for
even 100MB of stuff.

I don't think you have a CPU issue - this looks rather like an IO issue.

Oh and btw, the disk I am using is an LVM mapped ext3 local disk. The
server runs Centos. And I think it is pretty well managed.

Sounds good.

Now my questions:

i. I was earlier doing this: Read each file in sequence, and then
write out a similar hash, containing data from one src for each time.
Even with this hash, I used to dump it to temporary files (one per hash
key = time instant) and then after doing this for every file (src's). I
would then merge all this data in the temporary files into what I want.
This was slow, (but in retrospect not as slow as what I am doing now by
reading in parrallel). The reason I though opening multiple files and
reading from them line by line would be faster is cos I wouldn't have
to open/close all those thousands of temporary files.

Wait, did you say "thousands of temporary files"? Wow, *that* is likely going to kill performance.

I still don't see the reason why doing things in parrallel would screw
things up ? As my assumption is that the disk head is anyway pretty
wild as the system will involve a lot of IO b/w context switches.

That entirely depends on the other processing going on on that machine. And there is no point in making it worse by accessing multiple files at the same time.

ii. Also why do you think that writing another file in the midst of
reading (one or more) one is a bad idea ?. The only way I can avoid
this is to chunk up files into smaller units and then process them,
writing their results onto temp files and then proceeding along further
with other files. Which one do you think is a better thing to do ?

Writing is usually slower than reading. But it entirely depends on what you actually do.

I will now try the following (based on your suggestions):
i. Avoid heavy weight regexps in parsing lines

Not necessarily a good idea - especially it seems your task is IO bound. You can easily verify with "vmstat 5" while you are doing your processing.

ii. Avoid marshaling/dumping

This in itself is not slow. But if you do this into multiple small files you'll have problems - but not because of Marshal being slow but because of the IO.

iii. Focus on doing things a file at a time.
iv. If all else fails, I will try using something like sharkfish etc.

Btw will using something like an mmap extension for ruby speed things
up for me ?.

I don't believe so because your problem seems to be the IO access pattern.

Thanks a lot to all of you. I am sorry if I sound lame, but I am pretty
new to using ruby.

Can you describe in more detail what you are actually doing? Kind of: are you combining data from multiple files? How long do you need temp data? What do you output - things like that.

It may actually turn out that Ed's advice to use a RDBMS is the best solution because that gives you efficient IO. But to confirm or reject that I would have to have a better understanding what business problem you are trying to solve.

Cheers

  robert

···

On 15.11.2006 23:35, devesh wrote:

Btw will using something like an mmap extension for ruby speed things
up for me ?.

Can't harm to try. I use it for uploading large files and it helps
there.

Unfortunately Operating systems don't always behave like you think they
ought to when doing something like this. Get to know your profiler and
brush up on your data processing techniques. When you hit a speed
problem, you *have* to start and understand what the machine is doing
with each high level instruction you give it and how they all interact
with each other.

NeilW

Whoops! make that
   if (H has more than k keys ?)
     outputfile << Marshal.dump(H)
     H={}
   end

···

On 15/11/06, Farrel Lifson <farrel.lifson@gmail.com> wrote:

On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> Hi,
> Farrel Lifson wrote:
> > On 15/11/06, Devesh Agrawal <dagrawal@cs.umass.edu> wrote:
> >>
> >> close all files
> >> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.
> >> v. Would coding the realine part in C using rubyinline offer me speed
> >> the IO would be faster as I could just dump a fixed number of bytes to a
> >> To give you an idea of how slow this is actually: Just reading all the
> >>
> >> --
> >> Posted via http://www.ruby-forum.com/\.
> >>
> >
> > Could you not parrallelise the processing of each file? Perhaps using
> > something like starfish (http://www.rufy.com/starfish/doc/\)?
>
> Did you mean parrallelizing across multiple files or parrallelizing the
> processing of one file ?
>
> Yes and No. But this involves me getting deeper into a description of my
> problem:
>
> Each file has traceroutes at lots of times t1,t2.... The objective is to
> collect all traceroutes that happened at that time into one structure.
> Hence I could do something like this: Using ruby threads (or whatever)
> read each file, and store it into one *common* hashtable. And then call
> the syncing of the hashtable to the disk incase it has grown large
> enough.
>
> I was more hoping for some kind of a fast way to readlines, using say
> mmap or something like that. I read a few posts about how using mmap
> helped someone else. I will look into this starfish, I rejected ruby
> threads as unlike pthreads they weren't really true threads.
>
> Thanks for replying. Is there something wrong or inherently slow with
> the things I am doing ?. Can It be speeded up ?.
>
> > Farrel
>
> --
> Posted via http://www.ruby-forum.com/\.
>

I think this
     if (H has more than k keys ? (ie has it become very large))
                               H.keys.sort{|t|
                                       outputfile << Marshal.dump(H[t])
                                       H.delete(t)
                               }
                       end
can just be changed to
  if (H has more than k keys ?)
    outputfile << Marshal.dump(H)
  end
  H = {}

Farrel

Btw will using something like an mmap extension for ruby speed things
up for me ?.

Can't harm to try. I use it for uploading large files and it helps
there.

I beg to differ: it can actually harm to apply an optimization measure without making sure that it actually solves the problem. As long as you do not know what makes your program slow trying out measures is pretty much a waste of time.

Unfortunately Operating systems don't always behave like you think they
ought to when doing something like this. Get to know your profiler and
brush up on your data processing techniques. When you hit a speed
problem, you *have* to start and understand what the machine is doing
with each high level instruction you give it and how they all interact
with each other.

Exactly. I'd start with vmstat to find out whether we are IO bound or CPU bound.

Regards

  robert

···

On 16.11.2006 10:51, Neil Wilson wrote:

Robert Klemme wrote:

Btw will using something like an mmap extension for ruby speed things
up for me ?.

Can't harm to try. I use it for uploading large files and it helps
there.

I beg to differ: it can actually harm to apply an optimization measure without making sure that it actually solves the problem. As long as you do not know what makes your program slow trying out measures is pretty much a waste of time.

Unfortunately Operating systems don't always behave like you think they
ought to when doing something like this. Get to know your profiler and
brush up on your data processing techniques. When you hit a speed
problem, you *have* to start and understand what the machine is doing
with each high level instruction you give it and how they all interact
with each other.

Exactly. I'd start with vmstat to find out whether we are IO bound or CPU bound.

Regards

    robert

Just for the sake of amusement, I did a traceroute to see what the output looks like:

$ traceroute -n rubyforge.org
traceroute to rubyforge.org (205.234.109.18), 30 hops max, 38 byte packets
1 192.168.0.1 4.080 ms 2.845 ms 1.932 ms
2 * * *
3 68.87.218.17 9.540 ms 7.114 ms 7.187 ms
4 68.87.216.29 8.025 ms * 9.980 ms
5 12.116.188.9 21.930 ms 21.446 ms 21.619 ms
6 12.123.12.126 23.060 ms 21.566 ms 21.739 ms
     MPLS Label=31697 CoS=0 TTL=1 S=1
7 12.123.13.185 22.617 ms 22.220 ms 20.848 ms
8 206.24.211.1 21.786 ms 22.516 ms 20.589 ms
9 204.70.194.50 30.330 ms 28.985 ms 204.70.192.113 106.454 ms
     MPLS Label=202112 CoS=0 TTL=1 S=1
10 204.70.192.85 62.240 ms 204.70.192.134 87.983 ms 86.297 ms
     MPLS Label=509440 CoS=0 TTL=1 S=1
11 204.70.192.50 64.302 ms 204.70.192.25 104.547 ms 204.70.192.50 65.496 ms
     MPLS Label=190688 CoS=0 TTL=1 S=1
12 204.70.192.69 81.608 ms 208.173.10.201 105.080 ms 204.70.192.69 82.155 ms
     MPLS Label=131056 CoS=0 TTL=1 S=1
13 208.173.50.154 104.882 ms 136.592 ms 204.70.192.62 95.377 ms
14 205.234.109.18 101.528 ms !<10> 206.24.238.218 100.340 ms 100.945 ms
15 205.234.109.18 100.920 ms !<10> 208.173.50.154 100.550 ms 205.234.109.18 100.802 ms !<10>

So ... we have *one* traceroute from point a (my system) to point b (rubyforge.org). I'm assuming the original poster has some simple shell script that does a loop forever around something like "date; traceroute -n ip.add.re.ss; sleep 300" to get a log of timestamped traceroutes every five minutes. Then he wants to collect hundreds of these lines from many files (60 GB total) into a huge in-core Ruby structure, then make some kind of report from it. It got so big from the naive approach that he started marshalling it out of core in chunks, leading to the code he posted here.

A little arithmetic: that single traceroute above is about 1000 bytes. 60 GB is 60 million times 1000 bytes, which means our collection of 60 GB of traceroute data has about 60 million individual traceroutes, assuming they are as long as my off-the-wall traceroute over the open internet to rubyforge. Obviously they will be much smaller if done on a well-designed corporate intranet, which means there are many more of them.

In any event, it looks to me like he is trying to reinvent some network monitoring wheel for which there are undoubtedly both expensive commercial packages and wonderful full-featured open source tools available. I wrote a lot of code like this ten years ago (in ksh, awk and Perl 4) when such tools weren't available. Now they are, and there are Ruby interfaces to most of them. So in addition to buying and learning from "Data Crunching", I'm going to recommend the original poster head over to

and do some window shopping. IIRC there is a Ruby interface to RRDTool, although I've forgotten whether it's currently actively maintained -- it wasn't a year or so ago when I first started investigating Ruby. And there are definitely tools to migrate a "native" round robin database into an "ordinary" relational database.

···

On 16.11.2006 10:51, Neil Wilson wrote:

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

It got so big from the naive approach that he started marshalling it out of core in chunks, leading to the code he posted here.

Which code? I saw only the pseudo code from the first posting. Is there something that hasn't made it through the gateway again?

In any event, it looks to me like he is trying to reinvent some network monitoring wheel for which there are undoubtedly both expensive commercial packages and wonderful full-featured open source tools available. I wrote a lot of code like this ten years ago (in ksh, awk and Perl 4) when such tools weren't available. Now they are, and there are Ruby interfaces to most of them.

This makes perfectly sense to me.

Kind regards

  robert

···

On 16.11.2006 16:42, M. Edward (Ed) Borasky wrote:

Robert Klemme wrote:

···

On 16.11.2006 16:42, M. Edward (Ed) Borasky wrote:

It got so big from the naive approach that he started marshalling it out of core in chunks, leading to the code he posted here.

Which code? I saw only the pseudo code from the first posting. Is there something that hasn't made it through the gateway again?

I was referring to the pseudo-code.

--
M. Edward (Ed) Borasky, FBG, AB, PTA, PGS, MS, MNLP, NST, ACMC(P)
http://borasky-research.blogspot.com/

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

ed. that quote is aweomse.

-a

···

On Fri, 17 Nov 2006, M. Edward (Ed) Borasky wrote:

If God had meant for carrots to be eaten cooked, He would have given rabbits fire.

--
my religion is very simple. my religion is kindness. -- the dalai lama