Dynamodb latency
davidporter-id-au opened this issue · 15 comments
Both in production and in a simplified test we're seeing typical latency with dynamodb requests at being around ~60 Milliseconds.
This is certainly higher than what Dynamodb is reporting in its graphs and appears higher than most other naive implementations.
Python with the boto library took about ~12ms, making it nearly 6 times faster on average. Are we doing something wildly wrong?
Testing info:
region: ap-southeast-2
read IOPs were well above the usage (only on request per second).
node version: 5.6 (also tested with 0.12.x, 4x)
AWS sdk version: 2.2.35
@davidporter-id-au
By default, keepAlive is not enabled in node. Enabling this should speed things up as existing sockets will be reused. Do you mind testing this option out? You can do this by instantiating your DynamoDB
client with the following options:
var dynamo = new AWS.DynamoDB({
region: "ap-southeast-2",
httpOptions: {
agent: new https.Agent({
rejectUnauthorized: true,
keepAlive: true
})
}
});
So, I've learned a couple of things:
Versions matter:
This is a comparison of the versions: pretty picture and data. (Coincidentally, the number of requests between versions is different because node keeps crashing with the keepalive change, particularly on 0.12.9).
For all tests here I've added the change you've suggested.
- I was rather wrong when I said it affected 0.12.x to 5.x, it appears to be version specific. Node 0.12.9 exhibits the 60~ millisecond behaviour. We're running node 0.12.9 in production which is why I was originally investigating that specific version.
- The performance for node 5.6.0, 4.3.0 and 4.2.0 appears better. Anecdotally it averages around ~15 milliseconds - still not great - but significantly better.
- Node 5.6.0 has a very high variance in the latency. More than any other version I can see with my tiny sample.
- Node 4.x seems the best behaved of all versions. It still lags behind python's performance however, even with the keepalive changes.
Keepalive doesn't seem to matter, breaks occasionally
I added the changes as you suggested with keepalive. It appears to make no difference on the performance for any of the versions of node that I've checked. It does now seem to throw the following error intermittently now though:
$ node dynamo.js > results/node-4.2.0.log
events.js:141
throw er; // Unhandled 'error' event
^
Error: read ECONNRESET
at exports._errnoException (util.js:874:11)
at TLSWrap.onread (net.js:544:26)
All tests run within an ec2 under the same conditions as the first test other than I took NodeJS outside of Docker and ran it on the bare ec2. This appeared to bump performance by about ~5ms when I compared the same version of node (5.6.0) both within and outside a container.
@chrisradek So I ran the while loop again, both at a 'high' rate of once every 10 ms and a 'low' rate at once every second and did a flame-graph of what's going on*.
Graphs for Node 5.7.0: high freq, low freq
All the results across node versions look similar. The thing that is immediately outstanding to my fairly ignorant perspective, is there is a hell of a lot of CPU time being spent on credentials. The callstack is huge and the % of CPU time spent there is far higher than what I'd expect. Just my 2c, but I'd be slightly concerned that something in the way credentials are being validated, rotated and expired is hurting performance badly.
It's worth noting I'm running this on a Ubuntu Ec2 (standard AMI) with an instance role with permission to access dyanmo, so obviously it's reaching out to the metadata endpoint at some point in the lifecycle to get creds.
*
I took care to wait 5 seconds into the loop to account for the CPU time spent on require()
calls, metadata endpoint credential fetches and any other one-time operations at startup which would skew the CPU utilization. This is the shell script to make the graph, note 5 second sleep.
@davidporter-id-au
Please correct me if I'm misreading the graph, but I don't think it is necessarily showing that a lot of time is spent on retrieving credentials.
For each operation, the SDK will validate credentials to determine if they should be refreshed, so we would expect a large sample of credential related calls. The SDK also gets credentials when signing the request (this may be able to be optimized), but again, checks if credentials need to be refreshed before actually retrieving them. The graph doesn't seem to show how much CPU time is actually spent on each method call, but rather how many times each method was called.
How long were these tests run for? Since you're grabbing credentials using the EC2 metadata service, we should see the lib/credentials/ec2_metadata_credentials.js file referenced in the graph if credentials were actually being retrieved, which I don't see at all in this graph. The SDK does cache credentials,
Based on this graph alone, I can't say that the SDK is not spending a lot of CPU time on getting credentials, I just don't think it tells us much except how many times those methods are called.
Did you create flame graphs for other versions of nodeJS as well? Specifically those that take ~60ms vs ~15ms?
@chrisradek
I must admit to being fairly new to interpreting the data as well, but to quote the guy who wrote the graphing library:
"The width of the box shows the total time it was on-CPU or part of an ancestry that was on-CPU (based on sample count). Functions with wide boxes may consume more CPU per execution than those with narrow boxes, or, they may simply be called more often. The call count is not shown (or known via sampling)."
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
Regarding the node version issue, I did node versions 0.12.9
, 4.3.0
and 5.7.0
. To me they all look similar in basic structure and in that the credential calls dominate CPU time. The repo has all test cases.
Ah, sorry, I forgot to mention. the tests run for 30 seconds.
In thinking about your point about the frequency of the calls versus the CPU time consumed per call, you're right, I don't think I can make that determination. Nonetheless, would you expect that the quantity of events (in terms of CPU time spent net on those calls, irrespective of their being of great quantity or infrequently but with an expensive call) to be of that distribution?
I hope this is not too off-topic: I have a similar issue with the Lambda client. A test lambda calls a noop lambda with an overhead of ~ 50 ms.
The logs of the noop lambda show a duration of <1 ms per call (warmed). My measurings in the test lambda show >45 ms.
I too am experiencing this issue.
To see if it helps, I've hardcoded some credentials and then run the tests again and I've managed to get a 5-10% increase in throughput.
Clearly I don't want to run any production service with hardcoded credentials so I'll have to find a way around this.
Whoops, working better now!
With the SDK version 2.6.3 and enabled keep-alive the overhead dropped to <10 ms.
Closing old issue. Looks like this has been resolved.
@davidporter-id-au
By default, keepAlive is not enabled in node. Enabling this should speed things up as existing sockets will be reused. Do you mind testing this option out? You can do this by instantiating yourDynamoDB
client with the following options:var dynamo = new AWS.DynamoDB({ region: "ap-southeast-2", httpOptions: { agent: new https.Agent({ rejectUnauthorized: true, keepAlive: true }) } });
@chrisradek do you know how long is the keep live time? It look like the connection can only keep for 5s only.
@wjrjerome
It will depend on how many concurrent connections are being made against a host. There is a maxSockets parameter that can also be passed to the agent which controls the maximum number of sockets that can be opened to a host at once. By default, node sets this value to Infinity so this can cause each request to use a different socket. You can set a lower value to ensure a socket gets reused. You can find more info on maxSockets
and maxFreeSockets
in the node documentation.
I don't know if you can change how long keep-alive is set for, but node might be closing the connection after requests complete if maxSockets Infinity.