sleepAsync() massive slow down
cheatfate opened this issue · 13 comments
I'm trying to modify asynchttpserver.nim to make it more secure and reliable (add timeouts and limits).
But i found a real problem, using sleepAsync() as timeout timer, i've got a massive slow down in requests processing.
This benchmarks made on asynchttpserver.nim without any modifications:
MaTps 1495.00, Tps 1495.00, Resp Time 0.003, Err 0%, Count 1495
MaTps 1501.50, Tps 1560.00, Resp Time 0.003, Err 0%, Count 3055
MaTps 1503.85, Tps 1525.00, Resp Time 0.003, Err 0%, Count 4580
MaTps 1509.56, Tps 1561.00, Resp Time 0.003, Err 0%, Count 6141
MaTps 1512.01, Tps 1534.00, Resp Time 0.003, Err 0%, Count 7675
MaTps 1515.01, Tps 1542.00, Resp Time 0.003, Err 0%, Count 9217
MaTps 1517.82, Tps 1543.11, Resp Time 0.003, Err 0%, Count 10774
MaTps 1520.64, Tps 1546.00, Resp Time 0.003, Err 0%, Count 12320
MaTps 1523.87, Tps 1553.00, Resp Time 0.003, Err 0%, Count 13873
MaTps 1525.19, Tps 1537.00, Resp Time 0.003, Err 0%, Count 15410
MaTps 1526.87, Tps 1542.00, Resp Time 0.003, Err 0%, Count 16952
MaTps 1527.98, Tps 1538.00, Resp Time 0.003, Err 0%, Count 18490
MaTps 1524.48, Tps 1493.00, Resp Time 0.003, Err 0%, Count 19983
MaTps 1526.53, Tps 1545.00, Resp Time 0.003, Err 0%, Count 21528
As you can see we have very stable response time of 0.003s.
Now i add sleepAsync(1000) to asynchttpserver.nim line 178
Before:
await client.recvLineInto(lineFut) # TODO: Timeouts.
After:
await client.recvLineInto(lineFut) or sleepAsync(1000) # TODO: Timeouts.
Benchmark results for modified asynchttpserver.nim:
MaTps 788.21, Tps 788.21, Resp Time 0.006, Err 0%, Count 789
MaTps 769.33, Tps 599.40, Resp Time 0.008, Err 0%, Count 1389
MaTps 749.84, Tps 574.43, Resp Time 0.008, Err 0%, Count 1964
MaTps 741.86, Tps 670.00, Resp Time 0.008, Err 0%, Count 2634
MaTps 725.27, Tps 576.00, Resp Time 0.009, Err 0%, Count 3210
MaTps 717.78, Tps 650.35, Resp Time 0.008, Err 0%, Count 3861
MaTps 705.20, Tps 592.00, Resp Time 0.008, Err 0%, Count 4453
MaTps 691.28, Tps 566.00, Resp Time 0.009, Err 0%, Count 5019
MaTps 688.69, Tps 665.33, Resp Time 0.008, Err 0%, Count 5685
MaTps 678.32, Tps 585.00, Resp Time 0.009, Err 0%, Count 6270
MaTps 674.62, Tps 641.36, Resp Time 0.008, Err 0%, Count 6912
MaTps 666.96, Tps 598.00, Resp Time 0.008, Err 0%, Count 7510
MaTps 657.51, Tps 572.43, Resp Time 0.009, Err 0%, Count 8083
MaTps 657.39, Tps 656.34, Resp Time 0.008, Err 0%, Count 8740
MaTps 650.59, Tps 589.41, Resp Time 0.008, Err 0%, Count 9330
MaTps 642.48, Tps 569.43, Resp Time 0.009, Err 0%, Count 9900
MaTps 644.83, Tps 666.00, Resp Time 0.008, Err 0%, Count 10566
MaTps 638.39, Tps 580.42, Resp Time 0.009, Err 0%, Count 11147
As you can see its more then 2x times slower, then version without sleepAsync().
It also consumes much more memory... about 60Mb with sleepAsync() and only about 8mb without sleepAsync().
P.S. Benchmark tests was performed on Windows 7 x64 with openload application. Compiled using x86_64-w64-mingw32 (gcc (GCC) 4.8.3).
Some more information. I have modified asyncdispatch.nim to use DoubleLinkedList from lists.nim instead of seq for working with timers. And i got some bust to request processing.
MaTps 961.23, Tps 961.23, Resp Time 0.005, Err 0%, Count 967
MaTps 949.21, Tps 841.00, Resp Time 0.006, Err 0%, Count 1808
MaTps 931.09, Tps 768.00, Resp Time 0.007, Err 0%, Count 2576
MaTps 916.12, Tps 781.40, Resp Time 0.006, Err 0%, Count 3416
MaTps 926.61, Tps 1021.00, Resp Time 0.006, Err 0%, Count 4437
MaTps 913.95, Tps 800.00, Resp Time 0.006, Err 0%, Count 5237
MaTps 918.35, Tps 958.00, Resp Time 0.005, Err 0%, Count 6195
MaTps 906.12, Tps 796.00, Resp Time 0.006, Err 0%, Count 6991
MaTps 897.20, Tps 817.00, Resp Time 0.006, Err 0%, Count 7808
MaTps 911.38, Tps 1039.00, Resp Time 0.005, Err 0%, Count 8847
MaTps 902.15, Tps 819.00, Resp Time 0.006, Err 0%, Count 9666
MaTps 914.23, Tps 1023.00, Resp Time 0.005, Err 0%, Count 10689
MaTps 903.31, Tps 805.00, Resp Time 0.006, Err 0%, Count 11494
MaTps 894.10, Tps 811.25, Resp Time 0.005, Err 0%, Count 12345
MaTps 904.49, Tps 998.00, Resp Time 0.006, Err 0%, Count 13343
MaTps 887.94, Tps 739.00, Resp Time 0.007, Err 0%, Count 14082
MaTps 899.45, Tps 1003.00, Resp Time 0.005, Err 0%, Count 15085
MaTps 892.90, Tps 834.00, Resp Time 0.006, Err 0%, Count 15919
MaTps 884.21, Tps 806.01, Resp Time 0.005, Err 0%, Count 16750
MaTps 896.49, Tps 1007.00, Resp Time 0.006, Err 0%, Count 17757
MaTps 885.04, Tps 782.00, Resp Time 0.006, Err 0%, Count 18539
MaTps 898.64, Tps 1021.00, Resp Time 0.005, Err 0%, Count 19560
MaTps 890.98, Tps 822.00, Resp Time 0.006, Err 0%, Count 20382
But i think its not enough
sleepAsync
itself needs to be optimised, and by that I mean that it needs to be integrated with epoll/kqueue and whatever else. Currently it is separate.
I don't have time to investigate the slow down currently, would be brilliant if you could look into it.
Just for remind, not calling epochTime() for every timer is also gives us some performance.
MaTps 1108.00, Tps 1108.00, Resp Time 0.004, Err 0%, Count 1108
MaTps 1108.20, Tps 1110.00, Resp Time 0.005, Err 0%, Count 2218
MaTps 1127.68, Tps 1303.00, Resp Time 0.004, Err 0%, Count 3521
MaTps 1118.41, Tps 1035.00, Resp Time 0.005, Err 0%, Count 4556
MaTps 1109.05, Tps 1024.82, Resp Time 0.004, Err 0%, Count 5671
MaTps 1132.05, Tps 1339.00, Resp Time 0.004, Err 0%, Count 7010
MaTps 1113.54, Tps 947.00, Resp Time 0.005, Err 0%, Count 7957
MaTps 1129.59, Tps 1274.00, Resp Time 0.004, Err 0%, Count 9231
MaTps 1111.73, Tps 951.00, Resp Time 0.005, Err 0%, Count 10182
MaTps 1128.96, Tps 1284.00, Resp Time 0.004, Err 0%, Count 11466
MaTps 1107.56, Tps 915.00, Resp Time 0.005, Err 0%, Count 12381
MaTps 1091.09, Tps 942.83, Resp Time 0.004, Err 0%, Count 13453
MaTps 1096.88, Tps 1149.00, Resp Time 0.005, Err 0%, Count 14602
MaTps 1080.39, Tps 932.00, Resp Time 0.005, Err 0%, Count 15534
Any chance you could also run these benchmarks on Linux?
All tests performed in VM (Linux phantom 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux)
Compile options: nim c
asynchttpserver.nim without modifications
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.33ms 2.65ms 18.00ms 92.01%
Req/Sec 1.26k 77.61 1.52k 70.50%
25121 requests in 10.04s, 3.07MB read
Requests/sec: 2501.04
Transfer/sec: 312.63KB
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 4.34ms 2.63ms 17.87ms 91.91%
Req/Sec 1.25k 71.14 1.41k 68.50%
24985 requests in 10.02s, 3.05MB read
Requests/sec: 2492.59
Transfer/sec: 311.57KB
asynchttpserver.nim with sleepAsync(1000)
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.93ms 7.87ms 67.07ms 95.25%
Req/Sec 750.77 126.30 1.18k 82.50%
14990 requests in 10.03s, 1.83MB read
Requests/sec: 1494.65
Transfer/sec: 186.83KB
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 8.70ms 11.71ms 104.46ms 95.22%
Req/Sec 749.78 155.38 1.21k 87.50%
14972 requests in 10.04s, 1.83MB read
Requests/sec: 1491.59
Transfer/sec: 186.45KB
asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.36ms 11.20ms 82.74ms 94.19%
Req/Sec 0.98k 182.66 1.25k 86.00%
19570 requests in 10.02s, 2.39MB read
Requests/sec: 1953.59
Transfer/sec: 244.20KB
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.81ms 12.31ms 90.08ms 93.62%
Req/Sec 0.97k 219.84 1.22k 86.50%
19450 requests in 10.03s, 2.37MB read
Requests/sec: 1939.64
Transfer/sec: 242.45KB
asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]
epochTime() call only once in processTimers
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.37ms 11.43ms 84.06ms 94.12%
Req/Sec 1.00k 203.15 1.16k 87.50%
19858 requests in 10.03s, 2.42MB read
Requests/sec: 1980.45
Transfer/sec: 247.56KB
$ ./wrk -d 10s http://127.0.0.1:5555
Running 10s test @ http://127.0.0.1:5555
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.70ms 12.49ms 93.67ms 93.65%
Req/Sec 1.00k 223.59 1.16k 87.00%
20046 requests in 10.03s, 2.45MB read
Requests/sec: 1998.31
Transfer/sec: 249.79KB
It looks like DoubleLinkedList & epochTime give us not so much performance in Linux like in Windows.
Are you compiling with -d:release
?
all tests was made without -d:release
can you try it with -d:release
?
On Sat, Feb 27, 2016 at 5:44 PM, cheatfate notifications@github.com wrote:
all test was dont without -d:release
—
Reply to this email directly or view it on GitHub
#3909 (comment).
All tests performed in VM (Linux phantom 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux)
Compile options: nim c -d:release
asynchttpserver.nim without modifications
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 534.45us 818.53us 6.34ms 93.52%
Req/Sec 14.04k 434.97 14.84k 73.50%
279579 requests in 10.01s, 34.13MB read
Requests/sec: 27941.37
Transfer/sec: 3.41MB
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 533.20us 818.79us 6.60ms 93.51%
Req/Sec 14.12k 427.61 14.86k 78.00%
281196 requests in 10.01s, 34.33MB read
Requests/sec: 28100.21
Transfer/sec: 3.43MB
asynchttpserver.nim with sleepAsync(1000)
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.64ms 728.00us 8.73ms 92.51%
Req/Sec 3.17k 609.85 7.41k 95.50%
63221 requests in 10.01s, 7.72MB read
Requests/sec: 6315.75
Transfer/sec: 789.47KB
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.67ms 816.26us 9.30ms 92.85%
Req/Sec 3.15k 607.34 7.69k 97.00%
62890 requests in 10.03s, 7.68MB read
Requests/sec: 6272.39
Transfer/sec: 784.05KB
asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.32ms 1.67ms 15.50ms 95.70%
Req/Sec 4.81k 709.39 9.28k 95.00%
95678 requests in 10.00s, 11.68MB read
Requests/sec: 9565.12
Transfer/sec: 1.17MB
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.32ms 1.63ms 15.02ms 95.73%
Req/Sec 4.79k 679.59 8.87k 92.00%
95290 requests in 10.00s, 11.63MB read
Requests/sec: 9527.84
Transfer/sec: 1.16MB
asynchttpserver.nim with sleepAsync(1000)
timers' seq[tuple] replaced with DoubleLinkedList[tuple]
epochTime() call only once in processTimers
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.36ms 2.63ms 21.53ms 94.30%
Req/Sec 6.22k 841.87 9.85k 80.00%
123903 requests in 10.01s, 15.12MB read
Requests/sec: 12379.82
Transfer/sec: 1.51MB
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.39ms 2.67ms 21.40ms 94.22%
Req/Sec 6.11k 788.67 9.91k 75.50%
121520 requests in 10.00s, 14.83MB read
Requests/sec: 12148.23
Transfer/sec: 1.48MB
Ok, i have some results.
Currently only for Linux systems. I have implemented timerfd handles and i have created new
proc withTimeout*[T](fut: Future[T], timeout: int): Future[void]
in asyncdispatch.nim. So now i can control closing timefd handles without implementing future.cancel() :)
Benchmark results in VM (Linux phantom 4.4.1-2-ARCH #1 SMP PREEMPT Wed Feb 3 13:12:33 UTC 2016 x86_64 GNU/Linux)
Compile options: nim c
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.99ms 1.80ms 27.92ms 93.33%
Req/Sec 1.32k 39.82 1.44k 88.00%
26293 requests in 10.01s, 3.21MB read
Requests/sec: 2625.45
Transfer/sec: 328.18KB
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.99ms 1.81ms 27.59ms 93.24%
Req/Sec 1.32k 38.53 1.43k 89.50%
26284 requests in 10.02s, 3.21MB read
Requests/sec: 2623.27
Transfer/sec: 327.91KB
Compile options: nim c -d:release
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 616.43us 843.42us 6.60ms 93.96%
Req/Sec 11.23k 448.22 11.72k 85.50%
223473 requests in 10.00s, 27.28MB read
Requests/sec: 22338.98
Transfer/sec: 2.73MB
$ ./wrk -d 10 http://127.0.0.1:6666
Running 10s test @ http://127.0.0.1:6666
2 threads and 10 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 619.19us 845.30us 6.72ms 93.99%
Req/Sec 11.15k 484.71 11.64k 95.00%
222033 requests in 10.00s, 27.10MB read
Requests/sec: 22195.14
Transfer/sec: 2.71MB
This is now fixed I think.
This is not fixed actually.