Async-http vs puma

I've got a local rack app which is a simple disk/file based wiki with
no caching whatsoever. I did a simple test:

puma -w 8

% wrk -c 32 -t 32 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  32 threads and 32 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 40.90ms 25.46ms 228.79ms 68.09%
    Req/Sec 25.60 17.39 131.00 96.08%
  8146 requests in 10.10s, 27.17MB read
Requests/sec: 806.54
Transfer/sec: 2.69MB

async-http with 8 processes

% wrk -c 32 -t 32 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  32 threads and 32 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 9.96ms 6.08ms 119.85ms 98.30%
    Req/Sec 99.50 28.28 130.00 90.18%
  8626 requests in 10.10s, 28.89MB read
  Socket errors: connect 0, read 0, write 0, timeout 15
Requests/sec: 854.35
Transfer/sec: 2.86MB

It's an interesting comparison for a number of reasons

- async-http despite being pure ruby, is pretty similar to puma which
uses c extensions for parsing requests.
- puma has an average latency of 40ms which increases as contention
goes up, while async-http latency stays around 10ms and has a much
tighter standard deviation.
- both http servers max out all 8 virtual cores.

I'm not proclaiming "puma is slow", "async-http" is fast - they both
have hugely different performance profiles depending on workload.
async-http is not optimised at all - it's pure ruby.

Puma will handle traditional blocking IO much better than async-http
as it runs each request in a separate thread. On the other hand since
async-http runs each request in a Fiber, and supports cooperative IO
scheduling - all it would take is one upstream request, e.g.
RestClient.get "otherserver.com/resource" to completely saturate the
puma thread pools, while async-http in theory would continue to
service requests.

I've got a local rack app which is a simple disk/file based wiki with
no caching whatsoever. I did a simple test:

Cool. Care to try yahns out, too?

   https://yhbt.net/yahns-public/20170323040749.GA16820@dcvr/

Maybe start with this example config:
   https://yhbt.net/yahns/examples/yahns_rack_basic.conf.rb

I will never publish benchmarks for it myself; but I'll be happy
to help with tuning it.

I'm not proclaiming "puma is slow", "async-http" is fast - they both
have hugely different performance profiles depending on workload.
async-http is not optimised at all - it's pure ruby.

yahns uses a Ragel HTTP parser derived from mongrel, similar to
what puma has.

Puma will handle traditional blocking IO much better than async-http
as it runs each request in a separate thread.

Yep, I expect yahns to do blocking IO well, too.

On the other hand since
async-http runs each request in a Fiber, and supports cooperative IO
scheduling - all it would take is one upstream request, e.g.
RestClient.get "otherserver.com/resource" to completely saturate the
puma thread pools, while async-http in theory would continue to
service requests.

yahns doesn't use Fibers at the moment; and there's no API for
entering it's event loop; so yeah, yahns will be vulnerable to
that (unless one hooks into its internal API + event loop,
like lib/yahns/proxy_pass.rb does, but that's undocumented
and I'm not remotely committed to supporting
yet-another-public-API).

However, what yahns (and async-http) should handle well is bunch
of really slow clients trickling requests and reading responses
slowly.

The event loop it uses designed around kernel event queues
(epoll/kqueue), native threads, nonblocking socket I/O,
blocking accept, and optionally any number of worker processes.

design notes here: https://yhbt.net/yahns/design_notes.txt

(and if you can't access yhbt.net, it's because yahns is a
broken piece of crap :slight_smile:

···

Samuel Williams <space.ship.traveller@gmail.com> wrote:

Okay, I tried the default config.

In my testing, I found that yahns was about 3-4x slower than both puma
and async-http.

I'll do some more testing.

···

On 10 June 2017 at 18:27, Eric Wong <e@80x24.org> wrote:

Samuel Williams <space.ship.traveller@gmail.com> wrote:

I've got a local rack app which is a simple disk/file based wiki with
no caching whatsoever. I did a simple test:

Cool. Care to try yahns out, too?

   https://yhbt.net/yahns-public/20170323040749.GA16820@dcvr/

Maybe start with this example config:
   https://yhbt.net/yahns/examples/yahns_rack_basic.conf.rb

I will never publish benchmarks for it myself; but I'll be happy
to help with tuning it.

I'm not proclaiming "puma is slow", "async-http" is fast - they both
have hugely different performance profiles depending on workload.
async-http is not optimised at all - it's pure ruby.

yahns uses a Ragel HTTP parser derived from mongrel, similar to
what puma has.

Puma will handle traditional blocking IO much better than async-http
as it runs each request in a separate thread.

Yep, I expect yahns to do blocking IO well, too.

On the other hand since
async-http runs each request in a Fiber, and supports cooperative IO
scheduling - all it would take is one upstream request, e.g.
RestClient.get "otherserver.com/resource" to completely saturate the
puma thread pools, while async-http in theory would continue to
service requests.

yahns doesn't use Fibers at the moment; and there's no API for
entering it's event loop; so yeah, yahns will be vulnerable to
that (unless one hooks into its internal API + event loop,
like lib/yahns/proxy_pass.rb does, but that's undocumented
and I'm not remotely committed to supporting
yet-another-public-API).

However, what yahns (and async-http) should handle well is bunch
of really slow clients trickling requests and reading responses
slowly.

The event loop it uses designed around kernel event queues
(epoll/kqueue), native threads, nonblocking socket I/O,
blocking accept, and optionally any number of worker processes.

design notes here: https://yhbt.net/yahns/design_notes.txt

(and if you can't access yhbt.net, it's because yahns is a
broken piece of crap :slight_smile:

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

Okay, I thought something was wrong, looking at htop it wasn't using
all cores. I increased the number of workers to 8.

yahns completed 10000 requests in 12.225 seconds
puma completed 10000 requests in 12.871 seconds
async-http completed 10000 requests 15.202 seconds

This is using ab for testing which is particularly hard on
accept/socket handling since each request is an HTTP1.0 non-keep-alive
connection.

···

On 10 June 2017 at 20:07, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Okay, I tried the default config.

In my testing, I found that yahns was about 3-4x slower than both puma
and async-http.

I'll do some more testing.

On 10 June 2017 at 18:27, Eric Wong <e@80x24.org> wrote:

Samuel Williams <space.ship.traveller@gmail.com> wrote:

I've got a local rack app which is a simple disk/file based wiki with
no caching whatsoever. I did a simple test:

Cool. Care to try yahns out, too?

   https://yhbt.net/yahns-public/20170323040749.GA16820@dcvr/

Maybe start with this example config:
   https://yhbt.net/yahns/examples/yahns_rack_basic.conf.rb

I will never publish benchmarks for it myself; but I'll be happy
to help with tuning it.

I'm not proclaiming "puma is slow", "async-http" is fast - they both
have hugely different performance profiles depending on workload.
async-http is not optimised at all - it's pure ruby.

yahns uses a Ragel HTTP parser derived from mongrel, similar to
what puma has.

Puma will handle traditional blocking IO much better than async-http
as it runs each request in a separate thread.

Yep, I expect yahns to do blocking IO well, too.

On the other hand since
async-http runs each request in a Fiber, and supports cooperative IO
scheduling - all it would take is one upstream request, e.g.
RestClient.get "otherserver.com/resource" to completely saturate the
puma thread pools, while async-http in theory would continue to
service requests.

yahns doesn't use Fibers at the moment; and there's no API for
entering it's event loop; so yeah, yahns will be vulnerable to
that (unless one hooks into its internal API + event loop,
like lib/yahns/proxy_pass.rb does, but that's undocumented
and I'm not remotely committed to supporting
yet-another-public-API).

However, what yahns (and async-http) should handle well is bunch
of really slow clients trickling requests and reading responses
slowly.

The event loop it uses designed around kernel event queues
(epoll/kqueue), native threads, nonblocking socket I/O,
blocking accept, and optionally any number of worker processes.

design notes here: https://yhbt.net/yahns/design_notes.txt

(and if you can't access yhbt.net, it's because yahns is a
broken piece of crap :slight_smile:

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

wrk was similar, but it stresses non-connection handling - e.g. pure
read/write throughput:

puma

koyoko% wrk -c 16 -t 16 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  16 threads and 16 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 19.90ms 13.60ms 99.64ms 84.08%
    Req/Sec 53.69 29.92 140.00 69.01%
  8618 requests in 10.10s, 28.74MB read
Requests/sec: 853.27
Transfer/sec: 2.85MB

async-http

koyoko% wrk -c 16 -t 16 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  16 threads and 16 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 9.18ms 3.23ms 86.46ms 98.51%
    Req/Sec 109.47 17.70 121.00 95.49%
  8954 requests in 10.02s, 29.99MB read
  Socket errors: connect 0, read 0, write 0, timeout 4
Requests/sec: 893.68
Transfer/sec: 2.99MB

yahns

koyoko% wrk -c 16 -t 16 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  16 threads and 16 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 20.51ms 16.04ms 190.57ms 85.79%
    Req/Sec 54.43 32.59 191.00 65.56%
  8702 requests in 10.10s, 29.68MB read
Requests/sec: 861.61
Transfer/sec: 2.94MB

async-http still leading in terms of lowest latency, but perhaps I
should switch off yahns buffering?

koyoko% wrk -c 16 -t 16 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  16 threads and 16 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 19.69ms 12.00ms 118.05ms 82.00%
    Req/Sec 52.97 21.02 131.00 74.30%
  8470 requests in 10.03s, 28.89MB read
Requests/sec: 844.29
Transfer/sec: 2.88MB

nope, didn't make much difference.

···

On 10 June 2017 at 20:11, Samuel Williams <space.ship.traveller@gmail.com> wrote:

Okay, I thought something was wrong, looking at htop it wasn't using
all cores. I increased the number of workers to 8.

yahns completed 10000 requests in 12.225 seconds
puma completed 10000 requests in 12.871 seconds
async-http completed 10000 requests 15.202 seconds

This is using ab for testing which is particularly hard on
accept/socket handling since each request is an HTTP1.0 non-keep-alive
connection.

On 10 June 2017 at 20:07, Samuel Williams > <space.ship.traveller@gmail.com> wrote:

Okay, I tried the default config.

In my testing, I found that yahns was about 3-4x slower than both puma
and async-http.

I'll do some more testing.

On 10 June 2017 at 18:27, Eric Wong <e@80x24.org> wrote:

Samuel Williams <space.ship.traveller@gmail.com> wrote:

I've got a local rack app which is a simple disk/file based wiki with
no caching whatsoever. I did a simple test:

Cool. Care to try yahns out, too?

   https://yhbt.net/yahns-public/20170323040749.GA16820@dcvr/

Maybe start with this example config:
   https://yhbt.net/yahns/examples/yahns_rack_basic.conf.rb

I will never publish benchmarks for it myself; but I'll be happy
to help with tuning it.

I'm not proclaiming "puma is slow", "async-http" is fast - they both
have hugely different performance profiles depending on workload.
async-http is not optimised at all - it's pure ruby.

yahns uses a Ragel HTTP parser derived from mongrel, similar to
what puma has.

Puma will handle traditional blocking IO much better than async-http
as it runs each request in a separate thread.

Yep, I expect yahns to do blocking IO well, too.

On the other hand since
async-http runs each request in a Fiber, and supports cooperative IO
scheduling - all it would take is one upstream request, e.g.
RestClient.get "otherserver.com/resource" to completely saturate the
puma thread pools, while async-http in theory would continue to
service requests.

yahns doesn't use Fibers at the moment; and there's no API for
entering it's event loop; so yeah, yahns will be vulnerable to
that (unless one hooks into its internal API + event loop,
like lib/yahns/proxy_pass.rb does, but that's undocumented
and I'm not remotely committed to supporting
yet-another-public-API).

However, what yahns (and async-http) should handle well is bunch
of really slow clients trickling requests and reading responses
slowly.

The event loop it uses designed around kernel event queues
(epoll/kqueue), native threads, nonblocking socket I/O,
blocking accept, and optionally any number of worker processes.

design notes here: https://yhbt.net/yahns/design_notes.txt

(and if you can't access yhbt.net, it's because yahns is a
broken piece of crap :slight_smile:

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk>

wrk was similar, but it stresses non-connection handling - e.g. pure
read/write throughput:

Yes, yahns is designed around persistent connections (it
persists to infinity until it hits a connection threshold), but
all the defaults are to avoid head-of-line blocking situations;
even at the expense of throughput and latency.

puma

koyoko% wrk -c 16 -t 16 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  16 threads and 16 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 19.90ms 13.60ms 99.64ms 84.08%
    Req/Sec 53.69 29.92 140.00 69.01%
  8618 requests in 10.10s, 28.74MB read
Requests/sec: 853.27
Transfer/sec: 2.85MB

async-http

koyoko% wrk -c 16 -t 16 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  16 threads and 16 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 9.18ms 3.23ms 86.46ms 98.51%
    Req/Sec 109.47 17.70 121.00 95.49%
  8954 requests in 10.02s, 29.99MB read
  Socket errors: connect 0, read 0, write 0, timeout 4

Those timeouts with async-http seems most worrying. Your
original had it, too, but I didn't notice.

I'm most curious what happens in overload situations. At only
16 connections, all of these servers probably end up idling
and entering short sleep states.

Also, you used -c32 and -t32 in your initial benchmark...

Maybe increase by 10x, 100x, ...

koyoko% wrk -c 16 -t 16 -d 10 http://localhost:9292/wiki/index
Running 10s test @ http://localhost:9292/wiki/index
  16 threads and 16 connections
  Thread Stats Avg Stdev Max +/- Stdev
    Latency 20.51ms 16.04ms 190.57ms 85.79%
    Req/Sec 54.43 32.59 191.00 65.56%
  8702 requests in 10.10s, 29.68MB read
Requests/sec: 861.61
Transfer/sec: 2.94MB

async-http still leading in terms of lowest latency, but perhaps I
should switch off yahns buffering?

Unlikely, it doesn't look like your responses are big enough to
require buffering. yahns only buffers lazily (after it hits
EAGAIN). Which kernel are you using?

Beyond worker_processes, the "queue" directive might be
important, at least worker_threads:

    queue(:default) do
      worker_threads 7 # this is the default value, highly app-dependent
      max_events 1 # fairest, best in all multi-threaded cases
    end

In a synthetic benchmark, using max_events >1 probably makes it
look best, but it's not recommended for production: it will
introduce head-of-line-blocking when encountering any stalled
resources (e.g. slow DB queries and such).

What yahns was designed to do (with max_events=1) is if 6 out of
7 threads get stuck on some stalled resource (which releases the
Ruby GVL), the 7th thread will be able to handle requests for
all other connected clients (as long as the 7th thread is not
trying to use the same stalled resource as the other 6 threads)

So, if 6 threads are stuck waiting on slow queries using mysql2
(or anything which releases the GVL when waiting); the 7th
thread will still be able to handle fast responses for other
clients (if those responses can be handled using memcached or
similar).

Clients using persistent connections migrate freely across
worker threads in yahns: this hurts yahns in both average
latency AND throughput! But I've always considered avoiding
head-of-line blocking cases to be more critical.

···

Samuel Williams <space.ship.traveller@gmail.com> wrote:

yahns uses a Ragel HTTP parser derived from mongrel, similar to

what puma has.

Is this a separate gem?

I wish there was a gem for just a pure, simple, http parser.

I didn't notice the timeouts either, I don't know what that means in
the context of wrk. I'll have to check further. Given that I wrote the
HTTP parser/response generation from scratch, there might be a bug,
although HTTP1.x is pretty trivial.

I'm running latest linux kernel: 4.11.3-1-ARCH

I'm most curious what happens in overload situations

I've been wondering about this too.

If accept can't service the incoming connections fast enough, they
will fill up the backlog. It seems like then they should fail. I guess
in theory, the chance of that happening is pretty low, but I guess
it's still a possibility. Puma had a problem where it would accept
more connections then it could handle, and apparently that caused some
issues, I don't know specifically what was going on though.

Not yet, exactly. yahns is being reworked to use kcar (still
Ragel+C), but there's still work to be done.

  https://bogomips.org/kcar-public/20170419001349.GA3474@starla/

···

Samuel Williams <space.ship.traveller@gmail.com> wrote:

> yahns uses a Ragel HTTP parser derived from mongrel, similar to
what puma has.

Is this a separate gem?

I wish there was a gem for just a pure, simple, http parser.

Yeah, the actual kernel backlog is a bit bigger than the one
requested with listen(2); I forget the details offhand and they
varied between OSes, too.

yahns just requests a :backlog of 1024 by default (as did
Mongrel), but Linux net.core.somaxconn defaults to 128, so
it ends up capped at 128 for most Linux users.

It's probably not too important to yahns if the app is releasing
the GVL for slow ops, because yahns uses dedicated acceptor
threads.

I'm not sure what puma is doing nowadays, but yahns will accept
connections until it's out-of-FDs (limited by RLIMIT_NOFILE /
ulimit -n) and will process them in FIFO order (this ordering is
provided by kqueue/epoll).

yahns FD handling behaves like GC, if it detects it's reaching a
(configurable) client_expire_threshold, it'll start
disconnecting persistent connections. Otherwise, persistent
connections live forever. In the 3 years I've run yahns facing
real traffic, it hasn't hit that threshold since I have a
highish RLIMIT_NOFILE and not a lot of traffic:
https://raindrops-demo.bogomips.org/ )

Graceful shutdown is too conservative and takes too long because
yahns is reluctant to shut down idle connections. Not a huge
deal for me, but I guess it affects some folks with higher
memory usage during process replacement upgrades.

···

Samuel Williams <space.ship.traveller@gmail.com> wrote:

Eric Wong wrote:
> I'm most curious what happens in overload situations

I've been wondering about this too.

If accept can't service the incoming connections fast enough, they
will fill up the backlog. It seems like then they should fail. I guess
in theory, the chance of that happening is pretty low, but I guess
it's still a possibility. Puma had a problem where it would accept
more connections then it could handle, and apparently that caused some
issues, I don't know specifically what was going on though.