Requires Python 3.7. Enter prototype directory and python3 compare_algorithms.py
Everything exists in the prototype
directory and no additional downloads needed.
I'm comparing three different algorithms in three different configurations.
- The three algorithms are:
- TCP_NODELAY disabled (Nagle)
- TCP_NODELAY enabled
- TCP_NODELAY enabled with manual coalescing
- The three configurations are:
- Simple - spinning in a for loop until we get through all messages
- Threaded - using threads with a queue to share messages
- Async - using asyncio instead of threads
I test the time it takes to send 1 million messages to a (local) server, and then calculate throughput and latencies.
The algorithm I use for coalescing is to collect messages until 10 microseconds have elapsed. The non-coalescing versions send as soon as the data is received.
From an arbitrary run:
Run time (s): async coalescing tester : 4.4608571529 async no_delay tester : 6.3047845364 async nagle tester : 5.3005564213 threaded coalescing tester : 8.2327237129 threaded no_delay tester : 17.2464625835 threaded_nagle_tester : 17.6398150921 simple coalescing : 3.5081682205 simple no_delay : 4.9324331284 simple nagle : 3.4172878265 Throughput (ops/s): async coalescing tester : 224172.1637154853 async no_delay tester : 158609.7025572694 async nagle tester : 188659.4388440702 threaded coalescing tester : 121466.4836171429 threaded no_delay tester : 57982.9049091082 threaded_nagle_tester : 56689.9366450048 simple coalescing : 285049.0447267574 simple no_delay : 202739.6974225406 simple nagle : 292629.7258996351 Min (s): async coalescing tester : 0.0002143383 async no_delay tester : 0.0004105568 async nagle tester : 0.0003552437 threaded coalescing tester : 0.0006275177 threaded no_delay tester : 0.0006711483 threaded_nagle_tester : 0.0006721020 simple coalescing : 0.0000052452 simple no_delay : 0.0000040531 simple nagle : 0.0000047684 Max (s): async coalescing tester : 0.0975129604 async no_delay tester : 0.0977141857 async nagle tester : 0.1080279350 threaded coalescing tester : 0.1073188782 threaded no_delay tester : 0.0969240665 threaded_nagle_tester : 0.0984201431 simple coalescing : 0.1063480377 simple no_delay : 0.1006393433 simple nagle : 0.1041200161 Avg (s): async coalescing tester : 0.0005091909 async no_delay tester : 0.0010126891 async nagle tester : 0.0007697209 threaded coalescing tester : 0.0016748953 threaded no_delay tester : 0.0041312786 threaded_nagle_tester : 0.0041998828 simple coalescing : 0.0000999472 simple no_delay : 0.0012616609 simple nagle : 0.0005747096 P99 (s): async coalescing tester : 0.0010488033 async no_delay tester : 0.0179173946 async nagle tester : 0.0059802532 threaded coalescing tester : 0.0022656918 threaded no_delay tester : 0.0347113609 threaded_nagle_tester : 0.0355284214 simple coalescing : 0.0005433559 simple no_delay : 0.0383641720 simple nagle : 0.0207934380 P9999 (s): async coalescing tester : 0.0770001411 async no_delay tester : 0.0970644951 async nagle tester : 0.1075153351 threaded coalescing tester : 0.1071014404 threaded no_delay tester : 0.0965712070 threaded_nagle_tester : 0.0980396271 simple coalescing : 0.0188133717 simple no_delay : 0.0619621277 simple nagle : 0.0408947468
Based on the test runs, it appears we can get a modest boost to throughput without significantly affecting latency. Manually coalescing messages won for all three configurations. While the simple configuration affords us the greatest throughput, it's also not very practical as in the real world, most requests don't come from spinning through a single loop. Asyncio pretty clearly beats threads which makes sense given that there's a greater overhead to context switching threads without much parallelism benefit due to the GIL.
I had a hard time getting much more than 200k ops/s, even for a simple use case and while waiting for a much longer period of time. It seems there's a lot more overhead to using python in general, including just sending data on a socket. It'd be interesting to further see what kind of performance boost could be gained cythonizing parts of this code.
All of these tests were run locally with the server and client on the same developer laptop with unrelated processes running in the background. Ideally, I would have had client and server on different machines so one isn't affecting the performance of the other. It'd be best if neither of those machines was my laptop so the test can be repeatable and benchmarkable. Client and server on a laptop is also especially bad for simulating Nagle's algorithm, because the algorithm involves waiting for ACKs. Since there's no network latency in localhost, we'll get the ACK back almost immediately which isn't realistic.
I played with some other algorithms and configurations, but since the results weren't notable or comparable, and I didn't want the test runs to to take forever, I didn't keep them.
Among the failed attempts are:
- Setting TCP_CORK
- I attempted setting TCP_CORK and then disabling it immediately after sending all of the messages. This yielded better results than TCP_NODELAY on its own, but worse than manually coalescing. I'm assuming this is because the socket send call is the bottleneck.
- Using Multiprocessing
- Threading is bad because of the GIL, but IPC is worse.
- Too many timers
- My initial stab at asyncio involved creating a timer for every batch of messages and then sending the batch once the timer expired. The constant creating and destroying of timers took way too much time.