nim-lang/Nim

sleepAsync() massive slow down

cheatfate opened this issue · 13 comments

I'm trying to modify asynchttpserver.nim to make it more secure and reliable (add timeouts and limits).
But i found a real problem, using sleepAsync() as timeout timer, i've got a massive slow down in requests processing.

This benchmarks made on asynchttpserver.nim without any modifications:

MaTps 1495.00, Tps 1495.00, Resp Time  0.003, Err   0%, Count  1495
MaTps 1501.50, Tps 1560.00, Resp Time  0.003, Err   0%, Count  3055
MaTps 1503.85, Tps 1525.00, Resp Time  0.003, Err   0%, Count  4580
MaTps 1509.56, Tps 1561.00, Resp Time  0.003, Err   0%, Count  6141
MaTps 1512.01, Tps 1534.00, Resp Time  0.003, Err   0%, Count  7675
MaTps 1515.01, Tps 1542.00, Resp Time  0.003, Err   0%, Count  9217
MaTps 1517.82, Tps 1543.11, Resp Time  0.003, Err   0%, Count 10774
MaTps 1520.64, Tps 1546.00, Resp Time  0.003, Err   0%, Count 12320
MaTps 1523.87, Tps 1553.00, Resp Time  0.003, Err   0%, Count 13873
MaTps 1525.19, Tps 1537.00, Resp Time  0.003, Err   0%, Count 15410
MaTps 1526.87, Tps 1542.00, Resp Time  0.003, Err   0%, Count 16952
MaTps 1527.98, Tps 1538.00, Resp Time  0.003, Err   0%, Count 18490
MaTps 1524.48, Tps 1493.00, Resp Time  0.003, Err   0%, Count 19983
MaTps 1526.53, Tps 1545.00, Resp Time  0.003, Err   0%, Count 21528

As you can see we have very stable response time of 0.003s.
Now i add sleepAsync(1000) to asynchttpserver.nim line 178

Before:
await client.recvLineInto(lineFut) # TODO: Timeouts.
After:
await client.recvLineInto(lineFut) or sleepAsync(1000) # TODO: Timeouts.

Benchmark results for modified asynchttpserver.nim:

MaTps 788.21, Tps 788.21, Resp Time  0.006, Err   0%, Count   789
MaTps 769.33, Tps 599.40, Resp Time  0.008, Err   0%, Count  1389
MaTps 749.84, Tps 574.43, Resp Time  0.008, Err   0%, Count  1964
MaTps 741.86, Tps 670.00, Resp Time  0.008, Err   0%, Count  2634
MaTps 725.27, Tps 576.00, Resp Time  0.009, Err   0%, Count  3210
MaTps 717.78, Tps 650.35, Resp Time  0.008, Err   0%, Count  3861
MaTps 705.20, Tps 592.00, Resp Time  0.008, Err   0%, Count  4453
MaTps 691.28, Tps 566.00, Resp Time  0.009, Err   0%, Count  5019
MaTps 688.69, Tps 665.33, Resp Time  0.008, Err   0%, Count  5685
MaTps 678.32, Tps 585.00, Resp Time  0.009, Err   0%, Count  6270
MaTps 674.62, Tps 641.36, Resp Time  0.008, Err   0%, Count  6912
MaTps 666.96, Tps 598.00, Resp Time  0.008, Err   0%, Count  7510
MaTps 657.51, Tps 572.43, Resp Time  0.009, Err   0%, Count  8083
MaTps 657.39, Tps 656.34, Resp Time  0.008, Err   0%, Count  8740
MaTps 650.59, Tps 589.41, Resp Time  0.008, Err   0%, Count  9330
MaTps 642.48, Tps 569.43, Resp Time  0.009, Err   0%, Count  9900
MaTps 644.83, Tps 666.00, Resp Time  0.008, Err   0%, Count 10566
MaTps 638.39, Tps 580.42, Resp Time  0.009, Err   0%, Count 11147

As you can see its more then 2x times slower, then version without sleepAsync().
It also consumes much more memory... about 60Mb with sleepAsync() and only about 8mb without sleepAsync().

P.S. Benchmark tests was performed on Windows 7 x64 with openload application. Compiled using x86_64-w64-mingw32 (gcc (GCC) 4.8.3).

Some more information. I have modified asyncdispatch.nim to use DoubleLinkedList from lists.nim instead of seq for working with timers. And i got some bust to request processing.

MaTps 961.23, Tps 961.23, Resp Time  0.005, Err   0%, Count   967
MaTps 949.21, Tps 841.00, Resp Time  0.006, Err   0%, Count  1808
MaTps 931.09, Tps 768.00, Resp Time  0.007, Err   0%, Count  2576
MaTps 916.12, Tps 781.40, Resp Time  0.006, Err   0%, Count  3416
MaTps 926.61, Tps 1021.00, Resp Time  0.006, Err   0%, Count  4437
MaTps 913.95, Tps 800.00, Resp Time  0.006, Err   0%, Count  5237
MaTps 918.35, Tps 958.00, Resp Time  0.005, Err   0%, Count  6195
MaTps 906.12, Tps 796.00, Resp Time  0.006, Err   0%, Count  6991
MaTps 897.20, Tps 817.00, Resp Time  0.006, Err   0%, Count  7808
MaTps 911.38, Tps 1039.00, Resp Time  0.005, Err   0%, Count  8847
MaTps 902.15, Tps 819.00, Resp Time  0.006, Err   0%, Count  9666
MaTps 914.23, Tps 1023.00, Resp Time  0.005, Err   0%, Count 10689
MaTps 903.31, Tps 805.00, Resp Time  0.006, Err   0%, Count 11494
MaTps 894.10, Tps 811.25, Resp Time  0.005, Err   0%, Count 12345
MaTps 904.49, Tps 998.00, Resp Time  0.006, Err   0%, Count 13343
MaTps 887.94, Tps 739.00, Resp Time  0.007, Err   0%, Count 14082
MaTps 899.45, Tps 1003.00, Resp Time  0.005, Err   0%, Count 15085
MaTps 892.90, Tps 834.00, Resp Time  0.006, Err   0%, Count 15919
MaTps 884.21, Tps 806.01, Resp Time  0.005, Err   0%, Count 16750
MaTps 896.49, Tps 1007.00, Resp Time  0.006, Err   0%, Count 17757
MaTps 885.04, Tps 782.00, Resp Time  0.006, Err   0%, Count 18539
MaTps 898.64, Tps 1021.00, Resp Time  0.005, Err   0%, Count 19560
MaTps 890.98, Tps 822.00, Resp Time  0.006, Err   0%, Count 20382

But i think its not enough

dom96 commented

sleepAsync itself needs to be optimised, and by that I mean that it needs to be integrated with epoll/kqueue and whatever else. Currently it is separate.

I don't have time to investigate the slow down currently, would be brilliant if you could look into it.

Just for remind, not calling epochTime() for every timer is also gives us some performance.

MaTps 1108.00, Tps 1108.00, Resp Time  0.004, Err   0%, Count  1108
MaTps 1108.20, Tps 1110.00, Resp Time  0.005, Err   0%, Count  2218
MaTps 1127.68, Tps 1303.00, Resp Time  0.004, Err   0%, Count  3521
MaTps 1118.41, Tps 1035.00, Resp Time  0.005, Err   0%, Count  4556
MaTps 1109.05, Tps 1024.82, Resp Time  0.004, Err   0%, Count  5671
MaTps 1132.05, Tps 1339.00, Resp Time  0.004, Err   0%, Count  7010
MaTps 1113.54, Tps 947.00, Resp Time  0.005, Err   0%, Count  7957
MaTps 1129.59, Tps 1274.00, Resp Time  0.004, Err   0%, Count  9231
MaTps 1111.73, Tps 951.00, Resp Time  0.005, Err   0%, Count 10182
MaTps 1128.96, Tps 1284.00, Resp Time  0.004, Err   0%, Count 11466
MaTps 1107.56, Tps 915.00, Resp Time  0.005, Err   0%, Count 12381
MaTps 1091.09, Tps 942.83, Resp Time  0.004, Err   0%, Count 13453
MaTps 1096.88, Tps 1149.00, Resp Time  0.005, Err   0%, Count 14602
MaTps 1080.39, Tps 932.00, Resp Time  0.005, Err   0%, Count 15534
dom96 commented

Any chance you could also run these benchmarks on Linux?

All tests performed in VM (Linux phantom 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux)

Compile options: nim c

asynchttpserver.nim without modifications

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.33ms    2.65ms  18.00ms   92.01%
    Req/Sec     1.26k    77.61     1.52k    70.50%
  25121 requests in 10.04s, 3.07MB read
Requests/sec:   2501.04
Transfer/sec:    312.63KB

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.34ms    2.63ms  17.87ms   91.91%
    Req/Sec     1.25k    71.14     1.41k    68.50%
  24985 requests in 10.02s, 3.05MB read
Requests/sec:   2492.59
Transfer/sec:    311.57KB

asynchttpserver.nim with sleepAsync(1000)

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.93ms    7.87ms  67.07ms   95.25%
    Req/Sec   750.77    126.30     1.18k    82.50%
  14990 requests in 10.03s, 1.83MB read
Requests/sec:   1494.65
Transfer/sec:    186.83KB

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     8.70ms   11.71ms 104.46ms   95.22%
    Req/Sec   749.78    155.38     1.21k    87.50%
  14972 requests in 10.04s, 1.83MB read
Requests/sec:   1491.59
Transfer/sec:    186.45KB

asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.36ms   11.20ms  82.74ms   94.19%
    Req/Sec     0.98k   182.66     1.25k    86.00%
  19570 requests in 10.02s, 2.39MB read
Requests/sec:   1953.59
Transfer/sec:    244.20KB

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.81ms   12.31ms  90.08ms   93.62%
    Req/Sec     0.97k   219.84     1.22k    86.50%
  19450 requests in 10.03s, 2.37MB read
Requests/sec:   1939.64
Transfer/sec:    242.45KB

asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]
epochTime() call only once in processTimers

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.37ms   11.43ms  84.06ms   94.12%
    Req/Sec     1.00k   203.15     1.16k    87.50%
  19858 requests in 10.03s, 2.42MB read
Requests/sec:   1980.45
Transfer/sec:    247.56KB

$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.70ms   12.49ms  93.67ms   93.65%
    Req/Sec     1.00k   223.59     1.16k    87.00%
  20046 requests in 10.03s, 2.45MB read
Requests/sec:   1998.31
Transfer/sec:    249.79KB

It looks like DoubleLinkedList & epochTime give us not so much performance in Linux like in Windows.

dom96 commented

Are you compiling with -d:release?

all tests was made without -d:release

dom96 commented

can you try it with -d:release?

On Sat, Feb 27, 2016 at 5:44 PM, cheatfate notifications@github.com wrote:

all test was dont without -d:release


Reply to this email directly or view it on GitHub
#3909 (comment).

All tests performed in VM (Linux phantom 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux)

Compile options: nim c -d:release

asynchttpserver.nim without modifications

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   534.45us  818.53us   6.34ms   93.52%
    Req/Sec    14.04k   434.97    14.84k    73.50%
  279579 requests in 10.01s, 34.13MB read
Requests/sec:  27941.37
Transfer/sec:      3.41MB

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   533.20us  818.79us   6.60ms   93.51%
    Req/Sec    14.12k   427.61    14.86k    78.00%
  281196 requests in 10.01s, 34.33MB read
Requests/sec:  28100.21
Transfer/sec:      3.43MB

asynchttpserver.nim with sleepAsync(1000)

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.64ms  728.00us   8.73ms   92.51%
    Req/Sec     3.17k   609.85     7.41k    95.50%
  63221 requests in 10.01s, 7.72MB read
Requests/sec:   6315.75
Transfer/sec:    789.47KB

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.67ms  816.26us   9.30ms   92.85%
    Req/Sec     3.15k   607.34     7.69k    97.00%
  62890 requests in 10.03s, 7.68MB read
Requests/sec:   6272.39
Transfer/sec:    784.05KB

asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.32ms    1.67ms  15.50ms   95.70%
    Req/Sec     4.81k   709.39     9.28k    95.00%
  95678 requests in 10.00s, 11.68MB read
Requests/sec:   9565.12
Transfer/sec:      1.17MB

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.32ms    1.63ms  15.02ms   95.73%
    Req/Sec     4.79k   679.59     8.87k    92.00%
  95290 requests in 10.00s, 11.63MB read
Requests/sec:   9527.84
Transfer/sec:      1.16MB

asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]
epochTime() call only once in processTimers

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.36ms    2.63ms  21.53ms   94.30%
    Req/Sec     6.22k   841.87     9.85k    80.00%
  123903 requests in 10.01s, 15.12MB read
Requests/sec:  12379.82
Transfer/sec:      1.51MB

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.39ms    2.67ms  21.40ms   94.22%
    Req/Sec     6.11k   788.67     9.91k    75.50%
  121520 requests in 10.00s, 14.83MB read
Requests/sec:  12148.23
Transfer/sec:      1.48MB

Ok, i have some results.
Currently only for Linux systems. I have implemented timerfd handles and i have created new
proc withTimeout*[T](fut: Future[T], timeout: int): Future[void] in asyncdispatch.nim. So now i can control closing timefd handles without implementing future.cancel() :)

Benchmark results in VM (Linux phantom 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux)

Compile options: nim c

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.99ms    1.80ms  27.92ms   93.33%
    Req/Sec     1.32k    39.82     1.44k    88.00%
  26293 requests in 10.01s, 3.21MB read
Requests/sec:   2625.45
Transfer/sec:    328.18KB

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.99ms    1.81ms  27.59ms   93.24%
    Req/Sec     1.32k    38.53     1.43k    89.50%
  26284 requests in 10.02s, 3.21MB read
Requests/sec:   2623.27
Transfer/sec:    327.91KB

Compile options: nim c -d:release

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   616.43us  843.42us   6.60ms   93.96%
    Req/Sec    11.23k   448.22    11.72k    85.50%
  223473 requests in 10.00s, 27.28MB read
Requests/sec:  22338.98
Transfer/sec:      2.73MB

$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
  2 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   619.19us  845.30us   6.72ms   93.99%
    Req/Sec    11.15k   484.71    11.64k    95.00%
  222033 requests in 10.00s, 27.10MB read
Requests/sec:  22195.14
Transfer/sec:      2.71MB

dom96 commented

This is now fixed I think.

This is not fixed actually.