Benchmarking: Collect throughput metrics for AWS MSK
Closed this issue ยท 22 comments
Description
We want to test and benchmark producing and consuming against Kafka. The goal is to establish a throughput baseline and we'll use the cloud provider AWS MSK for this. Since
We MUST collect throughput and usage metrics.
Design
To keep things simple, we could start by creating a single command under cmd/queuebench
. We may retrofit the functionality into testing.B
, but it'll be quicker and less cumbersome to start with a single command that runs a single benchmark for some time and prints the results.
We'll run a single producer and consumer for the benchmark, collect the metrics using a Manual OTLP reader.
After the benchmark has finished, we'll collect the metrics that we want to print, perform any calculations (such as events/s
and bytes/s
).
Setup
The idea is to run a the benchmarks against a single topic (for now) and a single partition (we can expose a -partitions
flag which allows setting the number of partitions for the topic, but default to 1
for now). The command will:
- Create a unique topic name with the specified number of partitions (
-partitions
). - Generate N events with the specified byte size (
-event-size
). - Create a manual reader for OTLP Exporter.
- Create a consumer and start consuming in a goroutine.
- While timer isn't exceeded, produce events.
- Stop producer.
- Stop consumer after all events have been consumed.
- Collect metrics and perform any necessary calculations to print the results.
Inputs
-event-size
in bytes, default to 1024 (1KB).-partitions
default to 1. (keep it to a single topic producing for now).-duration
the amount of time to run the benchmark for.
Output (derived from metrics)
For now, we can just output the metrics in a table format. We may change this over time as we run this benchmarks periodically and in CI.
- Bytes/s
- Events/s
- Produce error rate in %.
- Consumption latency. We could print p50, p90, p99.
Adding here some questions:
- do we have tooling to generate data?
- I expect we are not yet collecting all the needed metrics for this benchmark, is that correct?
- Do we store those metrics long term?
- How should this benchmark system be monitored?
@endorama I've updated the issue to include more details. It was very brief before and contained incorrect metrics to collect.
do we have tooling to generate data?
My thinking is that we can keep that as a variable and just generate random bytes to produce and have the size be user-specified.
I expect we are not yet collecting all the needed metrics for this benchmark, is that correct?
After updating the metrics to output, we're recording all the metrics we need to print.
Do we store those metrics long term?
Not for now. We may store the output as .txt files in the future when the benchmarks are set up to run in CI.
How should this benchmark system be monitored?
I'm not sure I understand the question. The current task only covers the implementation of a command to run benchmarks against an AWS MSK Serverless cluster. We're not going to be working on the CI bit just yet, so I think this doesn't apply.
I think we should collect some numbers before closing this. The tooling is there, now we need to use it.
I've run the benchmark and collected some results. I also found a possible issue that I'll report later.
NOTE: please refer to these values instead of the one in this comment
Benchmark results
Environment:
- a MSK cluster created by this terraform code (within
msk
module ininfra/aws
) using AWS MSK default configurations - an EC2 t2.micro instance running queuebench, with production duration of 5m and timeout 25m (as instance size matters, is there any guidance on the instance size to use?)
Out of 3 runs:
- produced and consumed an average of total 27405450.67+-0.004% events
- produced an average of
- 91,352+-0.004% events per second
- 94,451,547.28+-0.004% bytes per second
- consumed an average of
- 19,717+-3.88% events per second
- 20,386,550.88+-3.88% bytes per second
Production duration was around 5m: average 300.000015s+-0%.
Consumption duration was around 22m: average 1401.789669s+-3.64%.
Percentiles for consumer delay:
- p50 is between 500ms and 750ms in all 3 cases
- p90 is between 750ms and 1000ms in 1 case, between 1000ms and 2500ms in 2 case
- p95 is between 750ms and 1000ms in 1 case, between 1000ms and 2500ms in 2 case
Results are with a caveat: consumer.messages.delay
is a Float64Histogram
so we have a limited resolution in the final data. That's why "is between", those are the histogram boundaries.
The default boundaries for the histogram are
0 | 5 | 10 | 25 | 50 | 75 | 100 | 250 | 500 | 750 | 1000 | 2500 | 5000 | 7500 | 10000 |
---|
Looking at results and distribution of latency in the histogram I suspect most of our tests will always give a p90 and p95 of 1000ms or 2500ms as roughly 40% of measures are in in that area (unless benchmark conditions change/relevant performance improvements).
We can be ok with this, but I don't think this provides enough resolution.
The easiest solution would be to change the histogram boundaries in queuebench, I discussed with @dmathieu and there is a caveat: histogram buckets in otel are global, we are setting some value in apmotel (as defined by Elastic APM specs) so we may not be able to use the same in production and in benchmarks. I'm not sure how much this is relevant.
Another solution could be to not rely on otel collected metrics and use a different approach (either through extracting metrics from traces or a custom approach).
@endorama Thanks for posting the results.
an EC2 t2.micro instance running queuebench, with production duration of 5m and timeout 25m (as instance size matters, is there any guidance on the instance size to use?)
I would steer clear from tx.xxx
instance types. They will rely on CPU credits heavily as soon as any CPU is required. Also a t2.micro
instance type will not have a lot of hardware resources assigned to it.
Instead, I'd use a general purpose hardware that's a bit more powerful and where network, CPU, memory or Disk I/O won't be a constraint. An m6g.large
or m7g.large
seem like a good fit to begin with.
Is the same instance used to produce and consume?
Is the same instance used to produce and consume?
Yes
I would steer clear from tx.xxx instance types.
Indeed, I already run the tests on c6i.xlarge
, which is the same used by apmbench
https://github.com/elastic/apm-server/blob/e2c2ed056f3bfa2cd8b2c6a218fa700c34552cf4/testing/infra/terraform/modules/benchmark_executor/variables.tf#L15-L19
An m6g.large or m7g.large seem like a good fit to begin with.
Any particular reason to choose this instance type?
Reporting benchmark run on a c6i.xlarge
, duration: 5m
, timeout: 25m
, event-size: 1024
, partitions: 1
.
- production
- produced 27533659 events in 300.000631471 seconds
- produced events per second: 91778.6701481046
- produced MB per second: 94.89341383520357
- consumption
- consumed 27533659 events in 925.301167017 seconds
- consumed events per second: 29756.429562024037
- consumed MB per second: 30.76629003373663
- error rate
- not consumed: 0 (0.00%)
- consumption latency (seconds)
- p50: 362.039
- p90: 724.077
- p99: 724.077
@marclop FWIW @endorama and I discussed the burstiness of t2 last week, and I suggested that given the low variance on the results they should be reliable. Otherwise if bursting had impacted the results, we should expect to see higher variance.
Looking at the latest numbers, seems that was fair for producing but not so much for consuming.
Thanks for posting the results! seems like producing to a topic with 1 partition has pretty good throughput. Given how the number of partitions controls the maximum concurrency for a given topic, I'd be interested to see the throughput with 2 and 4 partitions and see how that compares.
Is there a way to run this centrally? Or are you running them from your workstation?
I'm running them from a dedicated EC2 machine (infra setup in https://github.com/elastic/apm-managed-service/pull/274). I don't know if we have a team shared EC2 keypair, or we could add team members SSH keys to a cloud-init script for the instance.
I'd be interested to see the throughput with 2 and 4 partitions and see how that compares.
Reporting benchmark run on a c6i.xlarge
, 1 producer and 1 consumer with different partitions: 1, 2, 4.
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 2400 | 1024 | 1 | 1338.49317934 | 300.000007736 | 1338.400658493 | 27845782 | 27845782 | 95.97 | 21.51 | 724.077 | 1024 | 1024 |
300 | 2400 | 1024 | 2 | 2244.794431333 | 300.000009827 | 2244.701624293 | 54806738 | 54806738 | 188.89 | 25.24 | 1024 | 2048 | 2048 |
300 | 3600 | 1024 | 4 | 3900.106025814 | 300.000473461 | 3900.105787347 | 104678128 | 84430900 | 360.73 | 22.4 | 2048 | 4096 | 4096 |
300 | 4800 | 1024 | 4 | 5064.633432419 | 300.000393132 | 5064.501086936 | 106683669 | 106683669 | 367.65 | 21.78 | 2896.31 | 5792.62 | 5792.62 |
I had to increase the timeout when increasing partitions
, as production throughput grows a lot and consumption needs more time to keep up. The test with 4 partitions wasn't able to consume all produced records and timeout out (after 1h). I included the result as we were speaking about production, meanwhile I'm running the test again with a higher timeout.
Are we interested in running the benchmark with different producer/consumer configurations? (es 1 consumer per partition)
Updated table: added 4 partition, timeout 4800, successfully completed
@endorama Thanks! The producer numbers look good. Regarding consumer concurrency: I think it would be good to first figure out what the limit for one consumer is - we should be able to do better than 25MB/s.
Could you try increasing MaxPollRecords? Default is 100, so let's try 1000 to see the impact - https://github.com/elastic/apm-queue/blob/70febd6d78e11225cf887d7627993d215758571a/kafka/consumer.go#L54C2-L54C16
We could also try increasing https://pkg.go.dev/github.com/twmb/franz-go/pkg/kgo#FetchMinBytes for higher throughput. I don't think it should make a difference in the benchmarks, as the producer is outpacing the consumer quite significantly, so there should always be more than the minimum amount to fetch.
Reporting benchmark run on a c6i.xlarge
, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords
set to 1000
.
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.692624063 | 300.000008996 | 301.600596564 | 27679505 | 27679505 | 95.4 | 94.89 | 2 | 4 | 4 |
300 | 2400 | 1024 | 2 | 315.899254262 | 300.003819575 | 315.80385852 | 54699840 | 54699840 | 188.51 | 179.08 | 16 | 22.6274 | 22.6274 |
300 | 4800 | 1024 | 4 | 724.519516157 | 300.000006448 | 724.40076306 | 105248199 | 105248199 | 362.7 | 150.21 | 256 | 512 | 512 |
I also tried increasing FetchMinBytes to 10.
Reporting benchmark run on a c6i.xlarge
, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords
set to 1000
, FetchMinBytes(10)
.
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.690593174 | 300.00000779 | 301.600345584 | 27547301 | 27547301 | 94.94 | 94.44 | 2.82843 | 4 | 5.65685 |
300 | 2400 | 1024 | 2 | 345.577030678 | 300.000008109 | 345.501031635 | 54632726 | 54632726 | 188.29 | 163.49 | 32 | 45.2548 | 64 |
300 | 4800 | 1024 | 4 | 668.72929976 | 300.000008379 | 668.600260054 | 105607322 | 105607322 | 363.94 | 163.3 | 256 | 362.039 | 362.039 |
I also run the tests with MaxPollRecords
10000.
Reporting benchmark run on a c6i.xlarge
, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords
set to 10000
.
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.520444194 | 300.000009211 | 300.401036392 | 27582662 | 27582662 | 95.06 | 94.94 | 0.707107 | 1 | 1.41421 |
300 | 2400 | 1024 | 2 | 300.717424964 | 300.000008759 | 300.600821841 | 54329454 | 54329454 | 187.24 | 186.86 | 1.41421 | 2 | 2 |
300 | 4800 | 1024 | 4 | 362.237852628 | 300.000012872 | 362.100077259 | 104928793 | 104928793 | 361.59 | 299.58 | 64 | 90.5097 | 90.5097 |
Recently I discussed with @simitt these benchmark results and we had 2 doubts:
- if the consumption delay was reported in
ms
ors
- if increasing
MaxPollRecords
was affecting low throughput cases
Consumption delay is reported in seconds (I verified this running a benchmark with a single produced records, thus the histogram bucket the consumption delay goes into is comparable to the consumption duration registered).
To verify if low throughput was being negatively impacted by increasing MaxPollRecords
I changed the benchmark production logic to only produce 1 event per second for the specified duration. I run benchmark with MaxPollRecords=100
(default), MaxPollRecords=1000
and MaxPollRecords=10000
.
Here are the results
With MaxPollRecords=100
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.183734152 | 300.000530789 | 300.101155856 | 300 | 300 | 0 | 0 | 0.00390625 | 0.00552427 | 0.00552427 |
300 | 2400 | 1024 | 2 | 300.245170568 | 300.000985359 | 300.102103178 | 300 | 300 | 0 | 0 | 0.00390625 | 0.00390625 | 0.00390625 |
300 | 4800 | 1024 | 4 | 300.226847042 | 300.000155446 | 300.100669628 | 300 | 300 | 0 | 0 | 0.00552427 | 0.00552427 | 0.00552427 |
With MaxPollRecords=1000
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.208345594 | 300.000716583 | 300.100997146 | 300 | 300 | 0 | 0 | 0.00552427 | 0.0078125 | 0.0078125 |
300 | 2400 | 1024 | 2 | 300.175919277 | 300.000593529 | 300.100772963 | 300 | 300 | 0 | 0 | 0.00552427 | 0.00552427 | 0.0078125 |
300 | 4800 | 1024 | 4 | 300.214639527 | 300.000472654 | 300.101007325 | 300 | 300 | 0 | 0 | 0.00552427 | 0.0078125 | 0.0078125 |
With MaxPollRecords=10000
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.191874379 | 300.00073932 | 300.101061813 | 300 | 300 | 0 | 0 | 0.0078125 | 0.0078125 | 0.0078125 |
300 | 2400 | 1024 | 2 | 300.231933353 | 300.00022502 | 300.101230581 | 300 | 300 | 0 | 0 | 0.00552427 | 0.00552427 | 0.00552427 |
300 | 4800 | 1024 | 4 | 300.226024851 | 300.000247006 | 300.100921924 | 300 | 300 | 0 | 0 | 0.00552427 | 0.00552427 | 0.00552427 |
Note that 0 produced/consumed MB/s is a rounding error, the value is smaller than 0.00x
It seems there is a impact with worst case p95 being ~8ms vs ~5ms for the 1000
case, but that has not been observed for the 10000
case.
It seems that the performance penalty is also tied to number of partitions, with 4 partitions performing better than 1 when increasing MaxPollRecords
.
I investigated Kafka consumer settings that could affect throughput. Please correct anything that seems out of order :)
The most relevant Kafka consumer settings that affect throughput are (some of these seems more connected with throughput in multiple partitions scenarios - we have yet to discuss those):
Setting | Kafka default (ours) | Effect |
---|---|---|
fetch.min.bytes |
1B | The minimum amount of bytes a client wants to receive. If the broker receives a poll request and there are fewer bytes than this setting, will wait up to fetch.max.wait.ms . |
fetch.max.wait.ms |
500ms | The amount of time to wait for fetch.min.bytes to be true. |
fetch.max.bytes |
52428800B | The maximum amount of bytes the server returns for a fetch request. This is not an absolute maximum. Consumer performs fetches in parallel so this value does not determine the maximum bandwidth used. |
max.partition.fetch.bytes |
1048576B | The maximum number of bytes sent by broker per partition. Affects memory usage and risk of consumer timing out (more fetched data may relate to increased poll loop processing time potentially causing consumer timing out and rebalancing). |
max.poll.records |
500 (100) | The maximum number of records that a single poll request returns. As per max.partition.fetch.bytes processing time considerations apply. I'm not sure if there are waiting times implied here, it seems not. Throughput improvements come from fewer network roundtrips. May also affect memory usage due to processing more data. Is to be noted that increasing max.poll.records increase the risk of timeout (in case .poll() is not called again before the max.poll.interval.ms timeout - which defaults to 5m). |
How those settings related to config options in apm-queue/kafka.ConsumerConfig
?
Kafka setting | ConsumerConfig option |
---|---|
fetch.min.bytes |
no option available |
fetch.max.wait.ms |
MaxPollWait |
fetch.max.bytes |
MaxPollBytes |
max.partition.fetch.bytes |
MaxPollPartitionBytes |
max.poll.records |
MaxPollRecords |
I run 3 benchmarks changing these settings:
MaxPollRecords=1000
,FetchMinBytes=1
MaxPollRecords=10000
,FetchMinBytes=1
MaxPollRecords=1000
,FetchMinBytes=10
(1) and (2) performed better than tests with MaxPollRecords=100
.
(3) performed worse than (1).
Action taken:
- I created a PR to add more details to relevant consumer configs in apm-queue: #278
- I'm running a benchmark with
MaxPollRecords=1000
,FetchMinBytes=10
but reducingfetch.max.wait.ms
from500ms
to100ms
. - I plan to transfer some of this into our docs
Thanks for digging into all this @endorama!
@marclop and I discussed this a little bit yesterday, and Marc raised a good point: as long as we are using At Most Once Delivery, increasing MaxPollRecords will increase our exposure to data loss. That's because we commit the offsets before processing.
So setting to 10000 is perhaps too risky. Setting to 500-1000 would be safer -- generally we should set it to the lowest number that would satisfy throughput requirements.
To complete the picture, I run these additional runs (I'm starting to call each a "scenario" because the variables at play are growing).
Duration is always 300
. "Kept up" means if consumer was able to (almost) keep up with producer (by time & consumed bytes per second); yes means consuming events took the same time or no more than 125%; no means consuming events took at least 2x than producing them.
Remember that Consumption delay is in seconds.
Max Poll Records | Fetch Min Bytes | Max Poll Wait | partitions | kept up | consumption delay p50 | consumption delay p90 | consumption delay p95 |
---|---|---|---|---|---|---|---|
1000 | 1 | 500ms | 1 | โ | 2 | 4 | 4 |
1000 | 1 | 500ms | 2 | โ | 16 | 22.6274 | 22.6274 |
1000 | 1 | 500ms | 4 | โ | 256 | 512 | 512 |
1000 | 10 | 500ms | 1 | โ | 2.82843 | 4 | 5.65685 |
1000 | 10 | 500ms | 2 | โ | 32 | 45.2548 | 64 |
1000 | 10 | 500ms | 4 | โ | 256 | 362.039 | 362.039 |
1000 | 10 | 100ms | 1 | โ | 2.82843 | 4 | 4 |
1000 | 10 | 100ms | 2 | โ | 2.82843 | 4 | 4 |
1000 | 10 | 100ms | 4 | โ | 256 | 512 | 512 |
1000 | 100 | 100ms | 1 | โ | 2.82843 | 4 | 5.65685 |
1000 | 100 | 100ms | 2 | โ | 45.2548 | 90.5097 | 90.5097 |
1000 | 100 | 100ms | 3 | โ | 256 | 362.039 | 362.039 |
10000 | 1 | 500ms | 1 | โ | 0.707107 | 1 | 1.41421 |
10000 | 1 | 500ms | 2 | โ | 1.41421 | 2 | 2 |
10000 | 1 | 500ms | 4 | โ | 64 | 90.5097 | 90.5097 |
10000 | 10 | 100ms | 1 | โ | 0.353553 | 0.707107 | 1.41421 |
10000 | 10 | 100ms | 2 | โ | 1 | 1.41421 | 1.41421 |
10000 | 10 | 100ms | 4 | โ | 64 | 90.5097 | 90.5097 |
All results
scenario MaxPollRecords=1000 FetchMinBytes=1 MaxPollWait: 500ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.692624063 | 300.000008996 | 301.600596564 | 27679505 | 27679505 | 95.4 | 94.89 | 2 | 4 | 4 |
300 | 2400 | 1024 | 2 | 315.899254262 | 300.003819575 | 315.80385852 | 54699840 | 54699840 | 188.51 | 179.08 | 16 | 22.6274 | 22.6274 |
300 | 4800 | 1024 | 4 | 724.519516157 | 300.000006448 | 724.40076306 | 105248199 | 105248199 | 362.7 | 150.21 | 256 | 512 | 512 |
scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 500ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.690593174 | 300.00000779 | 301.600345584 | 27547301 | 27547301 | 94.94 | 94.44 | 2.82843 | 4 | 5.65685 |
300 | 2400 | 1024 | 2 | 345.577030678 | 300.000008109 | 345.501031635 | 54632726 | 54632726 | 188.29 | 163.49 | 32 | 45.2548 | 64 |
300 | 4800 | 1024 | 4 | 668.72929976 | 300.000008379 | 668.600260054 | 105607322 | 105607322 | 363.94 | 163.3 | 256 | 362.039 | 362.039 |
scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.878959177 | 300.001936163 | 301.80288708 | 27649339 | 27649339 | 95.29 | 94.72 | 2.82843 | 4 | 4 |
300 | 2400 | 1024 | 2 | 301.487009332 | 300.000008718 | 301.400786651 | 53773842 | 53773842 | 185.32 | 184.46 | 2.82843 | 4 | 4 |
300 | 4800 | 1024 | 4 | 716.016393495 | 300.000007006 | 715.90085789 | 106822033 | 106822033 | 368.13 | 154.26 | 256 | 512 | 512 |
scenario MaxPollRecords=1000 FetchMinBytes=100 MaxPollWait: 100ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.588041102 | 300.000009822 | 301.501001782 | 27537477 | 27537477 | 94.91 | 94.43 | 2.82843 | 4 | 5.65685 |
300 | 2400 | 1024 | 2 | 368.110207242 | 300.00000629 | 368.000307641 | 54652175 | 54652175 | 188.35 | 153.55 | 45.2548 | 90.5097 | 90.5097 |
300 | 4800 | 1024 | 4 | 633.922515154 | 300.000009003 | 633.800318506 | 105606356 | 105606356 | 363.93 | 172.26 | 256 | 362.039 | 362.039 |
scenario MaxPollRecords=10000 FetchMinBytes=1 MaxPollWait=500ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.520444194 | 300.000009211 | 300.401036392 | 27582662 | 27582662 | 95.06 | 94.94 | 0.707107 | 1 | 1.41421 |
300 | 2400 | 1024 | 2 | 300.717424964 | 300.000008759 | 300.600821841 | 54329454 | 54329454 | 187.24 | 186.86 | 1.41421 | 2 | 2 |
300 | 4800 | 1024 | 4 | 362.237852628 | 300.000012872 | 362.100077259 | 104928793 | 104928793 | 361.59 | 299.58 | 64 | 90.5097 | 90.5097 |
scenario MaxPollRecords=10000 FetchMinBytes=10 MaxPollWait: 100ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.381111593 | 300.00085311 | 300.301392553 | 27547995 | 27547995 | 94.94 | 94.85 | 0.353553 | 0.707107 | 1.41421 |
300 | 2400 | 1024 | 2 | 300.512250301 | 300.000009904 | 300.40044246 | 54298777 | 54298777 | 187.13 | 186.88 | 1 | 1.41421 | 1.41421 |
300 | 4800 | 1024 | 4 | 358.790258513 | 300.000017024 | 358.700447427 | 104625464 | 104625464 | 360.55 | 301.55 | 64 | 90.5097 | 90.5097 |
From these results I think we can drive some conclusions:
MaxPollRecords=1000
enables a stable (production/consumption) throughput of ~185MB/s with 2 partitionsMaxPollWait=500ms
can be reducedFetchMinBytes=10
helps in the scenarios with 1 or 2 partitions, less in the scenarios with 4
The sweet spot so far, also considering the concerns in the previous comment seems to be MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms
Can we set a number for our throughput requirements so to compare these results with it?
(I'm not sure how much the fixed event size of 1024
is affecting these results, so please keep that in mind).
I think this can be closed now, @axw what do you think? I'd open a follow-up issue to document the findings in our docs.
In the end I forgot to follow-up with the docs PR, which is now up here https://github.com/elastic/apm-managed-service/pull/619
As tables in docsmobile do not handle well all the columns needed to display this benchmark, I'm doing a recap of the raw data here. I'll link to this comment from our documentation, leveraging GitHub tables.
All scenarios run on a c6i.xlarge
with a single program using 1 producer and 1 consumer to produce and consume data for the specified amount of time (300
seconds).
More details on Kafka settings can be found in this comment.
scenario MaxPollRecords=100 FetchMinBytes=1 MaxPollWait: 500ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 2400 | 1024 | 1 | 1338.49317934 | 300.000007736 | 1338.400658493 | 27845782 | 27845782 | 95.97 | 21.51 | 724.077 | 1024 | 1024 |
300 | 2400 | 1024 | 2 | 2244.794431333 | 300.000009827 | 2244.701624293 | 54806738 | 54806738 | 188.89 | 25.24 | 1024 | 2048 | 2048 |
300 | 3600 | 1024 | 4 | 3900.106025814 | 300.000473461 | 3900.105787347 | 104678128 | 84430900 | 360.73 | 22.4 | 2048 | 4096 | 4096 |
300 | 4800 | 1024 | 4 | 5064.633432419 | 300.000393132 | 5064.501086936 | 106683669 | 106683669 | 367.65 | 21.78 | 2896.31 | 5792.62 | 5792.62 |
scenario MaxPollRecords=1000 FetchMinBytes=1 MaxPollWait: 500ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.692624063 | 300.000008996 | 301.600596564 | 27679505 | 27679505 | 95.4 | 94.89 | 2 | 4 | 4 |
300 | 2400 | 1024 | 2 | 315.899254262 | 300.003819575 | 315.80385852 | 54699840 | 54699840 | 188.51 | 179.08 | 16 | 22.6274 | 22.6274 |
300 | 4800 | 1024 | 4 | 724.519516157 | 300.000006448 | 724.40076306 | 105248199 | 105248199 | 362.7 | 150.21 | 256 | 512 | 512 |
scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 500ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.690593174 | 300.00000779 | 301.600345584 | 27547301 | 27547301 | 94.94 | 94.44 | 2.82843 | 4 | 5.65685 |
300 | 2400 | 1024 | 2 | 345.577030678 | 300.000008109 | 345.501031635 | 54632726 | 54632726 | 188.29 | 163.49 | 32 | 45.2548 | 64 |
300 | 4800 | 1024 | 4 | 668.72929976 | 300.000008379 | 668.600260054 | 105607322 | 105607322 | 363.94 | 163.3 | 256 | 362.039 | 362.039 |
scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.878959177 | 300.001936163 | 301.80288708 | 27649339 | 27649339 | 95.29 | 94.72 | 2.82843 | 4 | 4 |
300 | 2400 | 1024 | 2 | 301.487009332 | 300.000008718 | 301.400786651 | 53773842 | 53773842 | 185.32 | 184.46 | 2.82843 | 4 | 4 |
300 | 4800 | 1024 | 4 | 716.016393495 | 300.000007006 | 715.90085789 | 106822033 | 106822033 | 368.13 | 154.26 | 256 | 512 | 512 |
scenario MaxPollRecords=1000 FetchMinBytes=100 MaxPollWait: 100ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 301.588041102 | 300.000009822 | 301.501001782 | 27537477 | 27537477 | 94.91 | 94.43 | 2.82843 | 4 | 5.65685 |
300 | 2400 | 1024 | 2 | 368.110207242 | 300.00000629 | 368.000307641 | 54652175 | 54652175 | 188.35 | 153.55 | 45.2548 | 90.5097 | 90.5097 |
300 | 4800 | 1024 | 4 | 633.922515154 | 300.000009003 | 633.800318506 | 105606356 | 105606356 | 363.93 | 172.26 | 256 | 362.039 | 362.039 |
scenario MaxPollRecords=10000 FetchMinBytes=1 MaxPollWait=500ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.520444194 | 300.000009211 | 300.401036392 | 27582662 | 27582662 | 95.06 | 94.94 | 0.707107 | 1 | 1.41421 |
300 | 2400 | 1024 | 2 | 300.717424964 | 300.000008759 | 300.600821841 | 54329454 | 54329454 | 187.24 | 186.86 | 1.41421 | 2 | 2 |
300 | 4800 | 1024 | 4 | 362.237852628 | 300.000012872 | 362.100077259 | 104928793 | 104928793 | 361.59 | 299.58 | 64 | 90.5097 | 90.5097 |
scenario MaxPollRecords=10000 FetchMinBytes=10 MaxPollWait: 100ms
:
(config) duration | (config) timeout | (config) event size | (config) partitions | total duration | duration (production) | duration (consumption) | produced record/s | consumed record/s | produced (MB/s) | consumed (MB/s) | consumption delay (p50) | consumption delay (p90) | consumption delay (p95) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1500 | 1024 | 1 | 300.381111593 | 300.00085311 | 300.301392553 | 27547995 | 27547995 | 94.94 | 94.85 | 0.353553 | 0.707107 | 1.41421 |
300 | 2400 | 1024 | 2 | 300.512250301 | 300.000009904 | 300.40044246 | 54298777 | 54298777 | 187.13 | 186.88 | 1 | 1.41421 | 1.41421 |
300 | 4800 | 1024 | 4 | 358.790258513 | 300.000017024 | 358.700447427 | 104625464 | 104625464 | 360.55 | 301.55 | 64 | 90.5097 | 90.5097 |