akka/akka-persistence-dynamodb

Performance tests

Closed this issue · 13 comments

  • Would the shopping cart sample work for performance tests?
  • Should we add http routes to the sample to make it easier to test?
  • K6 as load client?

A bunch of performance testing has been completed now. Everything looks ok. Latencies are predictable, though not as tight as we see with RDS Postgres and R2DBC plugin. Main limitation is probably that the client is still HTTP/1 based. Would be interesting to try things when an HTTP/2 client is supported for DynamoDB, or maybe trying with a pipelining client.

Will add some results and screenshots to this issue.

Testing with the inventory sample, also used for some earlier testing. Load test at ~7k rps and peaks ~14k rps. Can see the increased latency at higher throughput.

load-test-01-gatling-summary load-test-01-gatling-graphs load-test-01-response-time load-test-01-persistence-time

Same load test with more resources. Before 9 x 4 CPU, now 16 x 4 CPU nodes.

load-test-02-gatling-summary load-test-02-gatling-graphs load-test-02-response-time load-test-02-persistence-time

Soak test at reasonable throughput. The latency spikes after ~10 minutes are in all of the initial tests, and with throughput close to capacity. Expect this is DynamoDB splitting partitions.

soak-test-01-gatling-summary soak-test-01-gatling-graphs soak-test-01-response-time soak-test-01-persistence-time

A second soak test running on the same tables and deployment, is now regular:

soak-test-02-gatling-summary soak-test-02-gatling-graphs soak-test-02-response-time soak-test-02-persistence-time

Under provisioned test, where the throughput is higher than the provisioned capacity on the table.

At first DynamoDB will allow the higher throughput, using burst capacity, before throttling the writes.

The throttled write errors will retry in the DynamoDB client. In this test, enough progress is made that the journal circuit breaker is not tripped, but requests timing out on ask timeout of 5 seconds.

under-provisioned-01-write-usage under-provisioned-01-throttling under-provisioned-01-gatling-summary under-provisioned-01-gatling-graphs under-provisioned-01-response-time under-provisioned-01-persistence-time

Another under provisioned test, with longer ask timeout. Open model test, so increasing number of virtual users when it can't keep up. But most of the errors are connections being refused by the load balancer in kubernetes, which will be preventing overload on the application itself.

under-provisioned-02-gatling-summary under-provisioned-02-gatling-graphs under-provisioned-02-response-time under-provisioned-02-persistence-time

Test with projections. Providing there are enough resources on the application side, then things can keep up fine. If the deployment is under provisioned, then it can persist events faster than projections can keep up — with projections doing more database work given the additional backtracking queries — consumer lag (wait time in the Cinnamon metrics) will increase, and eventually projections will start failing once it gets too far behind the backtracking window.

In this test run, enough resources to maintain projections. Publishing events disabled at this throughput. Projection envelope source distinguished in the metrics (query, pubsub, or backtracking). The latency spikes part way through should be partition splitting in DynamoDB.

projection-test-01-gatling-graphs projection-test-01-overview-akka-http projection-test-01-overview-akka-persistence projection-test-01-overview-akka-projections projection-test-01-overview-akka-projections-tooltip

For a hotspot test, a single entity instance hotspot will likely not be throttled, given lower WCU usage. Given an average write latency of 5ms, throughput of a single entity instance can only be 200/s. With larger payloads (consuming more WCUs), or multiple entity hotspots that are mapped to the same partition by DynamoDB, could then also see the partition throughput limit throttling. Throttling errors are retryable, so will retry in the client. Shouldn't necessarily cause client errors, but increased latencies and further limiting the throughput.

eventually projections will start failing once it gets too far behind the backtracking window

Will it eventually catch up if we stop writing at high throughput?
Is this the akka.persistence.dynamodb.query.backtracking.window?
Or is it akka.projection.dynamodb.offset-store.time-window?

Maybe we can revisit the windows for dynamodb since there are some differences compared to r2dbc, such as lazy loading of offsets and each BySliceQuery is for a single slice.

Yes, that should've been that it's failing on backtracking once it's outside the offset store time window (5 minutes). Envelopes rejected from backtracking because of unexpected sequence numbers. Doesn't recover. Can be recovered by restarting with a big enough backtracking window. We saw similar with r2dbc as well.

Don't know if we can improve that, but created a separate issue #67

Issues and fixes for projections falling behind are separated out. Projections are the weakest point in terms of performance, requiring plenty of resources. Otherwise I think performance testing is covered. Closing this issue.