Inter-Process Messaging

Daniel DeLorme wrote:

Thanks for all the answers. I now have:

primitives:
- TCPServer/TCPSocket
- UNIXServer/UNIXSocket
- IO.pipe
- named pipes
- shared memory (can't find low-level lib?)

possibly mmap: http://moulon.inra.fr/ruby/mmap.html

libraries: (after some digging on rubyforge)
- DRb
- Event Machine
- ActiveMessaging
- Slave
- BackgroundDRb
- System V IPC
- POSIXIPC (not released)
- reliable-msg
- MPI Ruby
- stomp / stompserver / stompmessage
- AP4R (Asynchronous Processing for Ruby)

Frankly I'm still not entirely clear on what are the advantages of each, both in terms of features and in terms of performance. I guess I'll just have to write some code to find out...

Daniel

1. You missed Rinda, a Linda derivative layered on top of DRb.

2. There are basically two ways to do concurrency/parallelism: shared memory and message passing. I'm not sure what the *real* tradeoffs are -- I've pretty much had my brain hammered with "shared memory bad -- message passing good", but there obviously must be *some* counter-arguments, or shared memory wouldn't exist. :slight_smile:

The way I look at it, Re: trade-offs, message passing is nice
because the API doesn't change whether the receiver is on localhost
or over the wire.

Shared memory is nice for very high bandwidth between processes;
say process A decodes a 300MB TIFF image into memory and wants
to make the bitmap available to process B.

3. System V IPC has three components -- message queues, semaphores, and shared memory segments. So ... if you've found a System V IPC library, you've found a shared memory library.

Regards,

Bill

···

From: "M. Edward (Ed) Borasky" <znmeb@cesmail.net>

Francis Cianfrocca wrote:

What no one has yet asked you is: what kind of data do you have to pass
between this fork-parent and child, and by what protocol?

Are they simple commands and responses (as in HTTP)? Is there a
state-machine (as in SMTP)? Or is the data-transfer full-duplex without a
protocol?

Keeping in mind that this project is mainly intended as a learning experience, my specific idea is a http server architecture with generalist and specialist worker processes. The dispatcher would partition requests to generalists (as in HTTP) who in turn might dispatch to specialists (as in SMTP) for particular sub-tasks.

Simple example: 100 requests for a thumbnail come at the same time, are split among N generalist workers, each of which asks the thumbnail specialist process to generate the thumbnail. The specialist catches the 100 simultaneous requests, generates the thumbnail *once* and sends the result back to the N generalists, who render it.

Are the data-flows extremely large? Exactly what are your performance
requirements?

Good questions, but ultimately my true purpose is to educate myself about parallel processing; that's the only requirement I have. So while I'd say data-flows are unlikely to be large in this case, I'd still like to how to handle large data-flows.

Hmmm, any good books to recommend?

(Putting on my fake beard) In the old days...

Shared memory is (probably) always the fastest solution; in fact, on some
OS's, local message passing is implemented as a layer on top of shared
memory.

But, of course, if you implement concurrency in terms of shared memory, you
have to worry about lock contention, queue starvation, and all the other
things that generally get handled for you if you use a higher-level
messaging protocol. And your software is now stuck with assuming that the
sender and receiver are on the same machine; most other messaging libraries
will work equally well on the same or different machines.

Back when machines were much slower, I had an application that already used
shared memory for caching, but was bogging down on system-level message
passing calls. There were three worker servers that would make requests to
the cache manager to load a record from the database into the shared-memory
cache for the worker to use. (This serialized database access and reduced
the number of open file handles.)

So I changed them to stop using the kernel-level message-queuing routines;
instead, they'd store their requests in a linked list that was kept in a
different shared memory region. The cache manager would unlink the first
request from the list, process it, and link that same request structure
back onto the "reply" list with a return code. The requests/replies were
very small, stayed in processor cache, etc., and there was much less
context-switching in and out of kernel mode since the queuing was now all
userland. This also saved a lot of memory allocate/frees, another
expensive operation at the time; most message-passing involves at least one
full copy of the message.

Occasionally, the request list would empty out, in which case we had to use
the kernel to notify the cache manager to wake up and check its list, but
that was a rare occasion, and "notify events" on that system were still
much cheaper than a queue message.

I would doubt that any of this type of optimization applies to Ruby on
modern OS's, however.

···

On Fri, 12 Oct 2007 15:18:26 +0900, M. Edward (Ed) Borasky wrote:

2. There are basically two ways to do concurrency/parallelism: shared
memory and message passing. I'm not sure what the *real* tradeoffs are
-- I've pretty much had my brain hammered with "shared memory bad --
message passing good", but there obviously must be *some*
counter-arguments, or shared memory wouldn't exist. :slight_smile:

--
Jay Levitt |
Boston, MA | My character doesn't like it when they
Faster: jay at jay dot fm | cry or shout or hit.
http://www.jay.fm | - Kristoffer

I got the joke... :slight_smile:

I assumed it was a takeoff on "Process.kill (impossible to send data)".

···

On Fri, 12 Oct 2007 00:17:11 +0900, Gary Wright wrote:

I guess I should have added some smiley faces. :slight_smile:

There is a whole area of computer security though, covert
signaling methods, where low bandwidth, unreliable communication paths
are still of interest.

--
Jay Levitt |
Boston, MA | My character doesn't like it when they
Faster: jay at jay dot fm | cry or shout or hit.
http://www.jay.fm | - Kristoffer

Gary Wright wrote:

I guess I should have added some smiley faces. :slight_smile:

: )

There is a whole area of computer security though, covert
signaling methods, where low bandwidth, unreliable communication paths
are still of interest.

http://www.ietf.org/rfc/rfc1149.txt

in case anyone hasn't seen that yet.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

You got me. I am PWNED!

:slight_smile:

···

On 10/11/07, Gary Wright <gwtmp01@mac.com> wrote:

On Oct 10, 2007, at 6:04 PM, Francis Cianfrocca wrote:
> On 10/10/07, Gary Wright <gwtmp01@mac.com> wrote:
>>
>>
>> Daniel DeLorme wrote:
>>> Right now I can think of some messaging primitives:
>>> - TCPServer/TCPSocket
>>> - UNIXServer/UNIXSocket (unstable?)
>>> - IO.pipe (doesn't need port#)
>>> - Process.kill (impossible to send data)
>>
>> Not that it would be useful but I guess that if you used
>> two different signals (say SIGUSR1 and SIGUSR2) then you've
>> got a method of communicating a sequence of binary digits
>> to another process. If you've got a reasonably accurate
>> way of timing things you could just manage this with a single
>> signal because the absence of the signal could be detected.
>
> There's some cleverness to this idea but I would avoid it. Signals
> interact
> very badly with threads and other system facilities, they're not
> deterministic, they have serious platform dependencies among different
> Unixes, they're heavyweight (resource intensive), and they have
> difficult
> APIs.

I guess I should have added some smiley faces. :slight_smile:

There is a whole area of computer security though, covert
signaling methods, where low bandwidth, unreliable communication paths
are still of interest.

Bill Kelly wrote:

From: "M. Edward (Ed) Borasky" <znmeb@cesmail.net>

...

2. There are basically two ways to do concurrency/parallelism: shared memory and message passing. I'm not sure what the *real* tradeoffs are -- I've pretty much had my brain hammered with "shared memory bad -- message passing good", but there obviously must be *some* counter-arguments, or shared memory wouldn't exist. :slight_smile:

The way I look at it, Re: trade-offs, message passing is nice
because the API doesn't change whether the receiver is on localhost
or over the wire.

Shared memory is nice for very high bandwidth between processes;
say process A decodes a 300MB TIFF image into memory and wants
to make the bitmap available to process B.

And shared memory might also be nice if you don't want queueing, you just want the latest data. Sensors in a real-time system, for example.

···

--
       vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407

Sounds like you've already read all the books you need to read. You have the
standard lingo down pat!

I'm probably the wrong person to ask, because I've been doing
high-performance parallel processing for many, many years, and my advice
(being experience-based) will assuredly fly in the face of orthodoxy.

But here goes: you picked the wrong project to demonstrate parallel
processing. Fast handling of network I/O is best done in an event-driven
way, and not in parallel. The parallelism that this problem exhibits arises
from the inherent nondeterminacy of having many independent clients
operating simultaneously. This pattern does expose capturable intramachine
latencies, but they're due to timing differentials, not to processing
inter-dependencies.

It's intuitively attractive to structure a network server as a set of
parallel processes or threads, but it doesn't add anything in terms of
performance or scalability. As regards multicore architectures, they add
little to a network server because the size of the incoming network pipe
typically dominates processor bandwidth in such applications.

You may rejoin: "but how about an HTTP server that does a massive amount of
local processing to fulfill each request?" Now that's more interesting. Just
get rid of the HTTP part and concentrate on how to parallelize the
processing. That's a huge and well-studied problem in itself, and the net is
full of good resources on it.

···

On 10/12/07, Daniel DeLorme <dan-ml@dan42.com> wrote:

Francis Cianfrocca wrote:
> What no one has yet asked you is: what kind of data do you have to pass
> between this fork-parent and child, and by what protocol?
>
> Are they simple commands and responses (as in HTTP)? Is there a
> state-machine (as in SMTP)? Or is the data-transfer full-duplex without
a
> protocol?

Keeping in mind that this project is mainly intended as a learning
experience, my specific idea is a http server architecture with
generalist and specialist worker processes. The dispatcher would
partition requests to generalists (as in HTTP) who in turn might
dispatch to specialists (as in SMTP) for particular sub-tasks.

Simple example: 100 requests for a thumbnail come at the same time, are
split among N generalist workers, each of which asks the thumbnail
specialist process to generate the thumbnail. The specialist catches the
100 simultaneous requests, generates the thumbnail *once* and sends the
result back to the N generalists, who render it.

> Are the data-flows extremely large? Exactly what are your performance
> requirements?

Good questions, but ultimately my true purpose is to educate myself
about parallel processing; that's the only requirement I have. So while
I'd say data-flows are unlikely to be large in this case, I'd still like
to how to handle large data-flows.

Hmmm, any good books to recommend?

At the risk of starting a threadjack, I think Ruby is today not as
well-served as some other development products in the way of native
message-passing systems. There is Assaf Arkin's very nice reliable-msg
library, which defines a good API for messaging, and has some support for
persistence. And of course there are several libraries which support Stomp,
making it easier to work with products like Java's AMQ.

I'd like to see a full-featured, high-performance MQ system for Ruby,
however. By "for Ruby" I don't necessarily mean "in Ruby," but rather
tightly integrated and easy/intuitive to use. Doing such a thing was the
original motivation for creating EventMachine, by the way.

···

On 10/12/07, Jay Levitt <jay+news@jay.fm> wrote:

On Fri, 12 Oct 2007 15:18:26 +0900, M. Edward (Ed) Borasky wrote:

So I changed them to stop using the kernel-level message-queuing routines;
instead, they'd store their requests in a linked list that was kept in a
different shared memory region. The cache manager would unlink the first
request from the list, process it, and link that same request structure
back onto the "reply" list with a return code. The requests/replies were
very small, stayed in processor cache, etc., and there was much less
context-switching in and out of kernel mode since the queuing was now all
userland. This also saved a lot of memory allocate/frees, another
expensive operation at the time; most message-passing involves at least
one
full copy of the message.

Occasionally, the request list would empty out, in which case we had to
use
the kernel to notify the cache manager to wake up and check its list, but
that was a rare occasion, and "notify events" on that system were still
much cheaper than a queue message.

I would doubt that any of this type of optimization applies to Ruby on
modern OS's, however.

Bill Kelly wrote:

From: "M. Edward (Ed) Borasky" <znmeb@cesmail.net>

...

2. There are basically two ways to do concurrency/parallelism: shared memory and message passing. I'm not sure what the *real* tradeoffs are -- I've pretty much had my brain hammered with "shared memory bad -- message passing good", but there obviously must be *some* counter-arguments, or shared memory wouldn't exist. :slight_smile:

The way I look at it, Re: trade-offs, message passing is nice
because the API doesn't change whether the receiver is on localhost
or over the wire.

(Just wanted to correct myself and say the API *need not* change,
rather than does not change... certainly some message passing
APIs only work on the local machine, but others are network-
transparent. :slight_smile:

Shared memory is nice for very high bandwidth between processes;
say process A decodes a 300MB TIFF image into memory and wants
to make the bitmap available to process B.

And shared memory might also be nice if you don't want queueing, you just want the latest data. Sensors in a real-time system, for example.

Ah, yeah. Makes sense.

Indeed, this can also apply over the wire; for example: dealing with
a client->server->client remote mouse click-drag-release operation.
(Where, for example, a client is dragging a scrollbar thumb where the UI is hosted on the server, and may be being rendered to multiple
clients.) One might transmit the initial click coordinate over TCP,
then the subsequent real-time drag telemetry over UDP, and finally the terminating click-release coordinate again over TCP. (Such that, TCP
is used for reliable messages, UDP for unreliable streaming best-effort
real-time telemetry where one only wants the latest data.)

Regards,

Bill

···

From: "Joel VanderWerf" <vjoel@path.berkeley.edu>

Francis Cianfrocca wrote:

Sounds like you've already read all the books you need to read. You have the
standard lingo down pat!

I'm afraid that must be a freak accident :wink:

But here goes: you picked the wrong project to demonstrate parallel
processing. Fast handling of network I/O is best done in an event-driven
way, and not in parallel. The parallelism that this problem exhibits arises
from the inherent nondeterminacy of having many independent clients
operating simultaneously. This pattern does expose capturable intramachine
latencies, but they're due to timing differentials, not to processing
inter-dependencies.

Could you elaborate what you mean by "timing differentials" and "processing inter-dependencies"? For regular webapps, time spent querying the database most certainly exposes capturable intramachine latencies. Event-driven sounds good, but doesn't that requires that *all* I/O be non-blocking? If you have blocking I/O in, say, a third-party lib, you're toast.

It's intuitively attractive to structure a network server as a set of
parallel processes or threads, but it doesn't add anything in terms of
performance or scalability. As regards multicore architectures, they add
little to a network server because the size of the incoming network pipe
typically dominates processor bandwidth in such applications.

You mean to say it's the network that is usually the bottleneck, not the CPU? Well, in my experience the database is usually the bottleneck, but let's not forget that ruby is particularly demanding on the CPU.

You may rejoin: "but how about an HTTP server that does a massive amount of
local processing to fulfill each request?" Now that's more interesting. Just
get rid of the HTTP part and concentrate on how to parallelize the
processing. That's a huge and well-studied problem in itself, and the net is
full of good resources on it.

While optimizing for CPU speed is fine, I'm also interested in process isolation. If you have a monster lib that takes 1 minute to initialize and requires 1 GB of resident memory but is used only occasionally, do you really want to load it in all of your worker processes?

Francis Cianfrocca wrote:

At the risk of starting a threadjack, I think Ruby is today not as
well-served as some other development products in the way of native
message-passing systems. There is Assaf Arkin's very nice reliable-msg
library, which defines a good API for messaging, and has some support for
persistence. And of course there are several libraries which support Stomp,
making it easier to work with products like Java's AMQ.

I'd like to see a full-featured, high-performance MQ system for Ruby,
however. By "for Ruby" I don't necessarily mean "in Ruby," but rather
tightly integrated and easy/intuitive to use. Doing such a thing was the
original motivation for creating EventMachine, by the way.

"If you build it, they will come." :slight_smile:

Could you elaborate what you mean by "timing differentials" and
"processing inter-dependencies"? For regular webapps, time spent
querying the database most certainly exposes capturable intramachine
latencies. Event-driven sounds good, but doesn't that requires that
*all* I/O be non-blocking? If you have blocking I/O in, say, a
third-party lib, you're toast.

Event-driven application frameworks often deal with library calls that
necessarily involve blocking network or intramachine I/O (DBMS calls being
the only really common example) by using a thread pool. For example,
EventMachine provides the #defer method to do this. It manages an internal
thread pool that runs outside of the main reactor loop.

In general, the problem of architecting a high-performance web server that
includes external dependencies like databases, legacy applications, SOAP,
message-queueing systems, etc etc, is a very big problem with no simple
answer. It's also been intensely studied, so there are resources for you all
over the web.

You mean to say it's the network that is usually the bottleneck, not the
CPU? Well, in my experience the database is usually the bottleneck, but
let's not forget that ruby is particularly demanding on the CPU.

I made that remark in relation to efforts to make network servers run faster
by hosting them on multiprocessor or multicore machines. You'll usually find
that a single computer with one big network pipe attached to it won't be
able to process the I/O fast enough to keep all the cores busy. You might
then be tempted to host the DBMS on the same machine, but that's rarely a
good idea. Simpler is better.

While optimizing for CPU speed is fine, I'm also interested in process

isolation. If you have a monster lib that takes 1 minute to initialize
and requires 1 GB of resident memory but is used only occasionally, do
you really want to load it in all of your worker processes?

You're assuming that worker processes are a good idea in the first place,
which I haven't granted :-). But seriously, if this is the road you want to
go down, then I'd recommend you look at message-queueing technologies. I'm
not in favor of distributed objects for a lot of reasons, but something like
DRb will give you the ability to avoid marshalling and unmarshalling data.
SOAP is another alternative, but having been there and done that, I'd rather
chew glass.

···

On 10/12/07, Daniel DeLorme <dan-ml@dan42.com> wrote:

Francis Cianfrocca wrote:

In general, the problem of architecting a high-performance web server that
includes external dependencies like databases, legacy applications, SOAP,
message-queueing systems, etc etc, is a very big problem with no simple
answer. It's also been intensely studied, so there are resources for you all
over the web.

In other words, Robert Heinlein's TANSTAAFL principle holds up for this domain, like many others: "There Ain't No Such Thing As A Free Lunch!"

Ironically, I was invited to a seminar a couple of weeks ago about concurrency titled, "The Free Lunch Is Over". What was ironic about it was that I couldn't attend because I had a prior commitment -- a service anniversary cruise with my employer at which I received a free lunch. :slight_smile:

I made that remark in relation to efforts to make network servers run faster
by hosting them on multiprocessor or multicore machines. You'll usually find
that a single computer with one big network pipe attached to it won't be
able to process the I/O fast enough to keep all the cores busy. You might
then be tempted to host the DBMS on the same machine, but that's rarely a
good idea. Simpler is better.

Up to a point, yes, simpler is better. But the goal of system performance engineering of this type is to have, as much as possible, a balanced system -- network, disk and processor utilizations approximately equal and none of them saturated. That's the "sweet spot" where you get the highest throughput for the lowest cost.

If your workload is well-behaved, you can sometimes get here "on the average over a workday". But web server workloads are anything but well-behaved, even in the absence of deliberate denial of service attacks. :slight_smile: