Is lots of files with Threads faster?

Chris_Richards · 7 February 2008 20:21

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

THanks
Chris

···

--
Posted via http://www.ruby-forum.com/.

Tim_Pease · 7 February 2008 20:33

Better to do it sequentially since (1) ruby is single threaded anyways, (2) the disk IO is going to be the biggest bottleneck, and (3) you'll most likely run out of file descriptors.

Blessings,
TwP

···

On Feb 7, 2008, at 1:21 PM, Chris Richards wrote:

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Gavin_Kistner3 · 7 February 2008 20:35

I suspect it depends on how long the parsing of data takes.

If it's fast, trying to read 50 files simultaneously will likely (I'm
guessing) cause disk thrashing that will slow you down.

If processing each file is much longer than reading the file from
disk, and you have multiple CPUs, and can use native threads, and can
schedule the read of one file to begin after another ends...probably
you can speed things up.

I made all those answers up, but I'm guessing they're correct

···

On Feb 7, 1:21 pm, Chris Richards <evilgeen...@gmail.com> wrote:

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Joel_VanderWerf1 · 7 February 2008 20:39

Chris Richards wrote:

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Is it possible that in the future you will need to do this with sockets in place of files?

···

--
vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

MenTaLguY1 · 7 February 2008 20:42

There's the same amount of IO bandwidth to go around no matter how many
threads you throw at the problem (and in practice if you add more threads you
start wasting bandwidth due to seeking and other overhead). Given that,
it's almost always best to do things sequentially.

If you are using a native-threaded runtime (e.g. JRuby), and you can prove
that you aren't consuming most of the available IO bandwidth yet (e.g. because
parsing is taking longer than the IO), then _maybe_ consider using multiple
threads, but then you need to be careful to only use enough to consume the
available IO bandwidth and no more. If you want to use your IO bandwidth most
effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

-mental

···

On Fri, 8 Feb 2008 05:21:51 +0900, Chris Richards <evilgeenius@gmail.com> wrote:

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

John_Carter · 7 February 2008 20:52

Prefer processes to threads on unix.

Depends on whether you have multiple cores.

Depends on what the file devices are. I have one small app where the
fd's are sockets to machines that may or may not have a certain other
application up. (The app finds out)

I spin one thread per machine, and open all connections in
parallel. The time to completion is the time for a single connect
fail, which is about N times faster than testing each connection in
series.

Depends also of data locality. Cache is many times faster than
ram. If you can live in cache, you go much faster. If multiple threads
mean you spend less time in cache, you go much slower.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : john.carter@tait.co.nz
New Zealand

···

On Fri, 8 Feb 2008, Chris Richards wrote:

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

James_Tucker · 9 February 2008 17:19

Take a look at the wide finder implementations on Tim Brays blog.

It's quite interesting to see over there how little IO was a bottleneck. (Which seems to have been repeated a number of times here).

Whilst the test environment is probably drastically different from your own, it might be worth looking at how some of those solutions solved the problem, and also give you some good reading on the topic.

···

On 7 Feb 2008, at 20:21, Chris Richards wrote:

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

THanks
Chris
--
Posted via http://www.ruby-forum.com/\.

Robert_K1 · 8 February 2008 10:17

> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just to do
> it sequentially?

There's the same amount of IO bandwidth to go around no matter how many
threads you throw at the problem (and in practice if you add more threads you
start wasting bandwidth due to seeking and other overhead). Given that,
it's almost always best to do things sequentially.

... unless all files reside on different IO devices in which case
parallel reading *can* be faster than sequentially. If they are on
the same filesystem I'd certainly prefer to read them sequentially.
There might be a slight performance gain by decoupling reading,
parsing (and probably output) into different threads. But that mostly
depends on IO speed and processing complexity and the slowest part
determines throughput - no matter what.

If you are using a native-threaded runtime (e.g. JRuby), and you can prove
that you aren't consuming most of the available IO bandwidth yet (e.g. because
parsing is taking longer than the IO), then _maybe_ consider using multiple
threads, but then you need to be careful to only use enough to consume the
available IO bandwidth and no more. If you want to use your IO bandwidth most
effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

Good points.

Cheers

robert

···

2008/2/7, MenTaLguY <mental@rydia.net>:

On Fri, 8 Feb 2008 05:21:51 +0900, Chris Richards <evilgeenius@gmail.com> wrote:

--
use.inject do |as, often| as.you_can - without end

Francis_Cianfrocca · 10 February 2008 12:19

I basically gave up on optimizing hard-disk I/O long ago. (In
Ruby/EventMachine, I started adding an event-driven interface for disk
files, and will probably complete it someday, but initial profiling showed
relatively little benefit.)

A big part of the problem is that different machines have different
controller hardware, with a wide variance not only in raw performance, but
also in caching strategies and in the way they schedule the physical seeks.
Multispindle systems change the behavior yet again. You can develop on one
machine hoping to get some level of performance improvement, and find a
totally different behavior when you go to production.

···

On Feb 9, 2008 12:19 PM, James Tucker <jftucker@gmail.com> wrote:

Take a look at the wide finder implementations on Tim Brays blog.

It's quite interesting to see over there how little IO was a
bottleneck. (Which seems to have been repeated a number of times here).

Whilst the test environment is probably drastically different from
your own, it might be worth looking at how some of those solutions
solved the problem, and also give you some good reading on the topic.

On 7 Feb 2008, at 20:21, Chris Richards wrote:

> Im required to open 50+ files and parse the data in them. WOuld using
> multiple threads give me the best performance? or is it best just
> to do
> it sequentially?
>
> THanks
> Chris
> --
> Posted via http://www.ruby-forum.com/\.
>

a11 · 10 February 2008 17:02

good advice. i've had quite a bit of experience optimizing large scale processing (really large) and seen that there is always an optimal io/cpu usage pattern (two processes per cpu in dual-cpu machines with dual disk controllers, etc) but also that it is *always* specific to the exact hardware setup. i agree that it's mostly impossible to try to come up with a generic solution.

cheers.

a @ http://codeforpeople.com/

···

On Feb 10, 2008, at 5:19 AM, Francis Cianfrocca wrote:

I basically gave up on optimizing hard-disk I/O long ago. (In
Ruby/EventMachine, I started adding an event-driven interface for disk
files, and will probably complete it someday, but initial profiling showed
relatively little benefit.)

A big part of the problem is that different machines have different
controller hardware, with a wide variance not only in raw performance, but
also in caching strategies and in the way they schedule the physical seeks.
Multispindle systems change the behavior yet again. You can develop on one
machine hoping to get some level of performance improvement, and find a
totally different behavior when you go to production.

--
share your knowledge. it's a way to achieve immortality.
h.h. the 14th dalai lama

Chris_Richards · 11 February 2008 00:08

wow.... just tried Jruby1.1 on my script that opens a thousand files and
processes them.

Ruby : 11seconds
JRuby 1st run : 3.3 seconds
Jruby second run : 1.1 Second

very nice darlin!

···

--
Posted via http://www.ruby-forum.com/.

Topic		Replies	Views
The faster way to read files ruby-talk	17	149	29 December 2011
Concurent (using threads) slower than sequential -doubt ruby-talk	9	132	8 October 2008
Ruby versus Java threading? ruby-talk	6	93	28 June 2007
Process or Thread? ruby-talk	7	93	23 August 2009
Repeatedly open file or save entire file to memory? ruby-talk	8	114	18 September 2009

Is lots of files with Threads faster?

Related topics