I've been working on a stack of gems (async, async-io and async-http)
as a proof of concept that Ruby can do fast async networking similar
to Node - and perhaps even better in some cases.
I have finally got async-http to the point where, as a
proof-of-concept, I think it's validated at least part of the above
statement.
https://github.com/socketry/async
https://github.com/socketry/async-io
https://github.com/socketry/async-http
On my desktop, I can get a nominal throughput of between 30,000 req/s
to 100,000 req/s with 4 cores/8 processes. The first number is for
discrete connections while the second is for keep-alive connections.
So, we can see the overhead for discrete connections is about 3x that
of connections which use keep-alive. I'd like to see higher
performance but I'm still trying to figure out how to benchmark it. I
broke RubyProf trying to do this.
Cool.
Fortunately, keep-alive is well supported by nginx which is what you'd
definitely want to use as a front-end anyway.
cmogstored could be a hair faster than nginx, even for small
responses(*). It was mainly designed to avoid pathological
cases in nginx when serving static files off multiple
filesystems/devices (as opposed to all-out speed); but I've
found it a decent benchmarking tool, too:
https://bogomips.org/cmogstored/README
https://bogomips.org/cmogstored/INSTALL
If you use FreeBSD, it's also in the ports tree (but not yet in
any GNU/Linux distros). It understands HTTP/1.1 and you can get
started without installing it, just building it and doing:
./cmogstored --docroot=/path/to/static/
That listens on all addresses on port 7500 by default, so to get
"FOO" in /path/to/static, you hit: http://127.0.0.1:7500/FOO
You can add "-W $NPROC" to use multiple worker processes in case FD
allocation in the kernel becomes a problem. "-W" is undocumented
for MogileFS users, since I don't want to break compatibility for
people using the original Perl mogstored and I haven't found FD
allocation contention does not seem to be a problem in the
real world.
(*) I'm also completely slacking off by using Ragel to parse
and snprintf(!) to generate response headers.
As well as performance, each individual request is run within a fiber.
This has both benefits and negatives - the negatives are that you
don't want to do computationally intensive work while generating a
response as you'll increase the latency of all other requests being
handled by that server process, the positive being that
computationally intensive work can be easily farmed off to a thread
pool, another process, etc, and I/O blocking workloads (e.g. HTTP RPC)
will work transparently (provided you use async capable IO instances)
and won't block the server.
Anyway, I thought others might be interested in this. There is a long
way to go to a 1.0 release but I think this is useful.
Cool. Thanks for sharing this; even if there's stuff below
I completely disagree with 
I've been thinking about what would make this kind of model faster -
in terms of how Ruby 3.0 might be changed to support this design. Here
is a complete brain dump of the various things I've been thinking
about:
- IO objects are inherently pretty heavy, calling read_nonblock and
write_nonblock is a mess. Internally, different systems do different
things (e.g. NET::HTTP manually implements timeouts by calling
read_nonblock and then wait_readable/wait_writable). The entire IO
system of Ruby is geared towards threads, but threads perform very
very poorly in MRI. By far the best design is multi-process for a ton
of different reasons.
Yes, IO objects are annoyingly big :<
Threads actually perform great for high throughput situations;
but yes, they're too big for dealing with network latency.
- IO objects expose a lot of behaviour which is irrelevant to most
use-cases (io/console, io/nonblock which doesn't seem to work at all).
This makes it hard to provide a clean high-level interface.
I'm not sure what you mean by "doesn't seem to work at all"
- All IO operations should be non-block with a super fast/simple API.
APIs which take complex lists of arguments, in the hot path, should be
avoided (exceptions: true for example). A separate function for
blocking and non-blocking IO is a huge cop-out.
NAK. I find value in using blocking accept/accept4 syscalls
(not emulating blocking with green threads/fibers + epoll/kqueue;
not even with EPOLLEXCLUSIVE)
TL; DR: I have studied the Linux kernel a bit and know
how to take advantage of it ---
This is because some blocking syscalls can take advantage of
"wake one" behavior in the Linux kernel to avoid thundering
herds. EPOLLEXCLUSIVE was added a few years ago to Linux to
appease some epoll users; but it's still worse for load
distribution at high accept rates. I'd rather embrace the fact
that epoll (and kqueue) themselves are (and must be) MT-friendly.
Similarly to accept, UNIXSocket#recv_io has the same behavior
with blocking recvmsg when the receiving socket is shared
between multiple processes.
cmogstored takes advantage of this "wake one" behavior by having
dedicated accept threads, so for users using the undocumented
"-W" option, it gives nearly perfect load balancing of
connections between workers.
Furthermore, non-blocking I/O on regular files and directories
does not exist in any portable or complete way on *nix
platforms. Threads (and processes)[2] are the only reasonable
ways to handle regular files and directories; even on NFS and other
network filesystems.
[2] inside Linux, they're both "tasks" with different levels of
sharing; the clone(2) manpage might be helpful to understand
this.
- TCPServer/TCPSocket/UDPSocket/UNIXServer/UNIXSocket are all broken
by design. This also includes somewhat Addrinfo class. It's hard to
provide non-blocking behaviour because so many things will turn a
string hostname into an IP address, calling `getaddrinfo`.
Perhaps resolv-replace.rb in the stdlib can make auto-Fiber
more useful (see below). And ruby-core could probably use some
help maintaining it, it's largely forgotten since the 1.8 days when
it was useful with green Threads. But yeah, getaddrinfo(3)
(along with all the other standardized name resolution APIs
before it) in the C standard library is a disaster for scalability.
- The `send` method for `IO` confusingly breaks `Object#send`. It
should just be `sendmsg` and `recvmsg` for datagrams and
`read`/`write` for streams. `UDPSocket#recv` and `Socket#recv` do
different things which is confusing.
`send` for streams is useful if you want to specify flags like
MSG_MORE and/or MSG_DONTWAIT. These flags are superior to
changing socket state via fcntl and setsockopt since they
require fewer syscalls and avoid races if multiple threads
operate on the same socket.
Instead, we should (and I think: have already been) downplaying
Object#send and encouraging Object#__send__ instead.
Also, I wish MSG_DONTWAIT were available for all files:
https://cr.yp.to/unix/nonblock.html
- Fibers are fast, but I think they need to be *the* first class
concurrency construct in Ruby and made as fast as possible. I heard
that calling resume on a fiber does a syscall (if this is the case it
should be removed if possible).
We're working on auto-scheduling Fibers for 2.5:
https://bugs.ruby-lang.org/issues/13618
(but API design is hard and not my department)
As far as syscalls, it should be possible to recycle Fiber stacks
(like we do with Thread stacks) to avoid mmap/mprotect/munmap.
Maybe ko1 is working on that, too...
We shouldn't need to save+restore signal masks since the Ruby
doesn't change them at normal runtime; but I think that's being
done behind our backs by the *context library calls. We may
need to use setjmp/longjmp directly instead of
(make/get/swap)context.
- Threads as they are currently implemented should be removed from
Ruby 3.0 - they actually make for a very poor concurrency concept,
considering the GIL. They make all other operations more complex with
no real benefit given how they are currently implemented. Reasoning
about threads is bloody hard. It's even worse that the GIL hides a lot
of broken behaviour. What are threads useful for? IO concurrency? yes
but it's poor performing. Computational concurrency? not unless you
use JRuby, Rubinius, and even then, my experience with those platforms
has generally been sub-par.
Again, native threads are useful for filesystem I/O, despite the GVL.
I wish threads could be more useful by releasing GVL for readdir and
stat operations; but releasing+acquring GVL is expensive :<
Short term, I might complete my attempts to make GVL faster for 2.5.
- It's hard to reason about strings because the encoding may change at
any time. Currently if you have a string with binary encoding, and
append a UTF-8 string, it's encoding will change and this breaks all
future operations (if you are assuming it's still a binary string).
Leading to hacks like
https://github.com/socketry/async-io/blob/master/lib/async/io/binary_string.rb
*shrug* I make all my strings binary as soon as I get them.
But I'm just a simple *nix plumber; everything is a bunch of
bytes to me; even processes/threads/fibers.
I think that Ruby 3.0 should
- either remove the GIL, or remove Thread.
The former would be nice
As has been mentioned by others;
doing it without hurting single-thread performance is the hard
part.
I'm still hopeful we can take advantage of liburcu and steal
more ideas from the Linux kernel (unfortunately, Ruby did not go
to GPL-2+ back in the day), but liburcu is LGPL-2.1+ and we
already use libgmp optionally.
- simplify IO classes and allow permanent non-blocking mode (e.g.
io.nonblocking = true; io.read gives data or :wait_readable).
That's backwards-incompatible and I'd rather we keep using
*_nonblock. In Ruby, 2.5 *_nonblock will take advantage of
MSG_DONTWAIT and avoid unnecessary fcntl for sockets under
Linux: https://bugs.ruby-lang.org/issues/13362
- ensure that Fiber is as fast as possible (creation, scheduling, etc).
AFAIK, ko1 is working on it for 2.5
- remove broken-by-design IO related classes (move to gem for
backwards compatibility?)
Make them skinnier, yes. I'm not sure how we can remove them
and dividing up core functionality makes it more difficult to
maintain.
- read/write should be able to append to a byte string efficiently. A
byte buffer designed for fast append, index and fast slice! is a must
for most high-level protocols.
Perhaps offsets to IO read/write operations can do this:
https://bugs.ruby-lang.org/issues/11484
For those familiar with Perl5, Perl has had sysread and syswrite
functions which are capable of taking offsets to avoid unnecessary
copying.
I'm not sure how the API would be done in Ruby, though;
and kwargs is deficient in our current C API:
https://bugs.ruby-lang.org/issues/13434
(I consider 13434 higher priority)
- perhaps support readv and writev under the hood - this would allow
you to write an array of buffers without needing to concatenate them
which saves on syscalls/memcpy.
Agreed for writev. No idea how a readv API would even work for Ruby...
Anyways, thank you for telling us your concerns. ruby-core will
try to do our best to improve Ruby without breaking existing code.
···
Samuel Williams <space.ship.traveller@gmail.com> wrote: