Process or Thread?

I have a parent application (which I think of as a test harness) that
wants to invoke a fairly intensive image processing application against
a directory full of image files. Each image is processed independently.

So, to get performance, I wanted to get the work happening on each of
those images in parallel. So I could divide the files in the directory
into two sets, and submit one set for processing in one process/thread
and the other set in another process/thread. Note that the
sub-process/threads are almost totally separate from the parent app, so
relatively little information needs to go back and forth.

Here is what I've learned so far from reading two books and lots of
googling:

One point is that there's no process support on Windows, which isn't a
deal killer for me.

Another point is the operation on multi-core CPUs: processes will, and
threads will not use the mutliple cores. This too is fairly "don't care"
for me.

I am interested in ease of implementation and debugging. And I am also
very interested in getting the cpu and disk active at the same time as
there is a fairly large amount of data to be read form the disk.

What are your recommendations?

···

--
Posted via http://www.ruby-forum.com/.

Pito Salas wrote:

I have a parent application (which I think of as a test harness) that
wants to invoke a fairly intensive image processing application against
a directory full of image files. Each image is processed independently.

It doesn't sound like your situation will result in improved performance
with threads. Things don't actually get done at the same time with
threads--that's an illusion. What happens is that there is very fast
switching between different tasks. However, if your tasks do not have
dead time during the processing, then using threads won't improve
performance. For instance, suppose you have two tasks that each take 3
minutes to complete. The processing might happen in this order with
threads:

task1: 1 minute
task2: 1 minute
task1: 1 minute
task2: 1 minute
task1: 1 minute
task2: 1 minute

···

--------------
total = 6 minutes

But if you just ran each task sequentially without using threads, the
total time would also be 6 minutes. Using threads will only speed up
processing time if your tasks have idle time when they are doing
nothing. During that down time, if you switch to another task in
another thread, then total processing time will be lower.

--
Posted via http://www.ruby-forum.com/\.

I have a parent application (which I think of as a test harness) that
wants to invoke a fairly intensive image processing application against
a directory full of image files. Each image is processed independently.

So, to get performance, I wanted to get the work happening on each of
those images in parallel. So I could divide the files in the directory
into two sets, and submit one set for processing in one process/thread
and the other set in another process/thread. Note that the
sub-process/threads are almost totally separate from the parent app, so
relatively little information needs to go back and forth.

Here is what I've learned so far from reading two books and lots of
googling:

One point is that there's no process support on Windows, which isn't a
deal killer for me.

Not quite. Look in Task Manager, there is a list of processes running.
What Windows possibly lacks is fork(), the unix way of creating
processes. It does however have CreateProcess (I think that's what
is called), which behaves like fork+exec.

If you split the "controller" process and the "worker" process into
two different programs, it won't be a problem. If you insist on
having them as one program, you'll need to do a bit more work
(add a comamnd line argument telling the new process that it's a
worker process).

Another point is the operation on multi-core CPUs: processes will, and
threads will not use the mutliple cores. This too is fairly "don't care"
for me.

Native threads will, Ruby green threads won't.

I am interested in ease of implementation and debugging.

Debugging is lots easier with processes, as one process cannot
accidentally overwrite data of another (shared memory is possible,
but needs to be allocated explicitly).

That may not be as big a problem with Ruby green threads, as the
runtime knows what each thread is up to.

And I am also
very interested in getting the cpu and disk active at the same time as
there is a fairly large amount of data to be read form the disk.

What are your recommendations?

I would go for processes. But that's coming from C, where there is no
runtime keeping track of what each thread is doing. With processes,
the OS will prevent one OS from overwriting the data of another.

/Kent

···

Den Sat, 22 Aug 2009 09:30:36 -0500 skrev Pito Salas:
--
"The Brothers are History"

Pito Salas wrote:

It doesn't sound like your situation will result in improved performance

with threads. Things don't actually get done at the same time with
threads--that's an illusion. What happens is that there is very fast
switching between different tasks. However, if your tasks do not have
dead time during the processing, then using threads won't improve
performance. For instance, suppose you have two tasks that each take 3
minutes to complete. The processing might happen in this order with
threads:

task1: 1 minute
task2: 1 minute
task1: 1 minute
task2: 1 minute
task1: 1 minute
task2: 1 minute
--------------
total = 6 minutes

That's true if you´re running MRI, since it uses "green" threads (i.e., you
really have a single OS-level thread that gets task-switched by Ruby
itself). However, if you run on JRuby, the Ruby Thread support gets mapped
onto the Java Thread support, which *does* map to OS-level threads and
therefore will take advantage of multiple cores if you have them. In that
case you *would* get faster processing.

Hope this helps,
-Mario.

···

On Sat, Aug 22, 2009 at 16:47, 7stud -- <bbxx789_05ss@yahoo.com> wrote:

IMHO a multitude of processes does not necessarily ease debugging. If you need to find out which process is running berserk or exhibiting a bug that may be more difficult than debugging of a single interpreter process. Also, if there are communication issues between two processes that may be difficult to debug as well.

Having said that, both approaches are pretty easy to implement, given that DRb is a full fledged remote object call feature (similar to RMI and CORBA).

Kind regards

  robert

···

On 23.08.2009 01:28, Kent Friis wrote:

I am interested in ease of implementation and debugging.

Debugging is lots easier with processes, as one process cannot
accidentally overwrite data of another (shared memory is possible,
but needs to be allocated explicitly).

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

For a CPU intensive task (image processing), i doubt that two OS
threads running on two core's is going to be any more efficient
than two processes running on two cores. Multi-threading introduces
complications that are neatly avoided by using multiple processes.
I'd much rather deal with a multi-process architecture than a
multi-threaded architecture.

Gary Wright

···

On Aug 22, 2009, at 1:43 PM, Mario Camou wrote:

That's true if you´re running MRI, since it uses "green" threads (i.e., you
really have a single OS-level thread that gets task-switched by Ruby
itself). However, if you run on JRuby, the Ruby Thread support gets mapped
onto the Java Thread support, which *does* map to OS-level threads and
therefore will take advantage of multiple cores if you have them. In that
case you *would* get faster processing.

Gary Wright wrote:

For a CPU intensive task (image processing), i doubt that two OS
threads running on two core's is going to be any more efficient
than two processes running on two cores. Multi-threading introduces
complications that are neatly avoided by using multiple processes.
I'd much rather deal with a multi-process architecture than a
multi-threaded architecture.

Gary Wright

Thanks all for your responses.

A note: he files being processed are quite large and numerous. So
there's also plenty of file IO that has to happen. In the vanilla 'green
thread' case, would you expect performance improvements, because while
one thread was blocked for IO the other one could run?

Thanks again,

Pito

···

--
Posted via http://www.ruby-forum.com/\.

Whether you use threads or processes your CPU-bound tasks will run while
your IO-bound tasks are waiting for the disk.

Gary Wright

···

On Aug 22, 2009, at 5:02 PM, Pito Salas wrote:

A note: he files being processed are quite large and numerous. So
there's also plenty of file IO that has to happen. In the vanilla 'green
thread' case, would you expect performance improvements, because while
one thread was blocked for IO the other one could run?