elastic/apm-queue

Benchmarking: Collect throughput metrics for AWS MSK

Closed this issue ยท 22 comments

Description

We want to test and benchmark producing and consuming against Kafka. The goal is to establish a throughput baseline and we'll use the cloud provider AWS MSK for this. Since

We MUST collect throughput and usage metrics.

Design

To keep things simple, we could start by creating a single command under cmd/queuebench. We may retrofit the functionality into testing.B, but it'll be quicker and less cumbersome to start with a single command that runs a single benchmark for some time and prints the results.

We'll run a single producer and consumer for the benchmark, collect the metrics using a Manual OTLP reader.

https://github.com/elastic/apm-queue/blob/63f6f01bf68a76aad5fd2f7bd44af0f7519f8e16/kafka/metrics_test.go#L323C2-L323C2

After the benchmark has finished, we'll collect the metrics that we want to print, perform any calculations (such as events/s and bytes/s).

Setup

The idea is to run a the benchmarks against a single topic (for now) and a single partition (we can expose a -partitions flag which allows setting the number of partitions for the topic, but default to 1 for now). The command will:

  1. Create a unique topic name with the specified number of partitions (-partitions).
  2. Generate N events with the specified byte size (-event-size).
  3. Create a manual reader for OTLP Exporter.
  4. Create a consumer and start consuming in a goroutine.
  5. While timer isn't exceeded, produce events.
  6. Stop producer.
  7. Stop consumer after all events have been consumed.
  8. Collect metrics and perform any necessary calculations to print the results.

Inputs

  • -event-size in bytes, default to 1024 (1KB).
  • -partitions default to 1. (keep it to a single topic producing for now).
  • -duration the amount of time to run the benchmark for.

Output (derived from metrics)

For now, we can just output the metrics in a table format. We may change this over time as we run this benchmarks periodically and in CI.

  • Bytes/s
  • Events/s
  • Produce error rate in %.
  • Consumption latency. We could print p50, p90, p99.
simitt commented

@marclop @axw could you narrow the expectations for this task a bit more and add some definition of how the benchmark setup should look like.

Adding here some questions:

  • do we have tooling to generate data?
  • I expect we are not yet collecting all the needed metrics for this benchmark, is that correct?
  • Do we store those metrics long term?
  • How should this benchmark system be monitored?

@endorama I've updated the issue to include more details. It was very brief before and contained incorrect metrics to collect.

do we have tooling to generate data?

My thinking is that we can keep that as a variable and just generate random bytes to produce and have the size be user-specified.

I expect we are not yet collecting all the needed metrics for this benchmark, is that correct?

After updating the metrics to output, we're recording all the metrics we need to print.

Do we store those metrics long term?

Not for now. We may store the output as .txt files in the future when the benchmarks are set up to run in CI.

How should this benchmark system be monitored?

I'm not sure I understand the question. The current task only covers the implementation of a command to run benchmarks against an AWS MSK Serverless cluster. We're not going to be working on the CI bit just yet, so I think this doesn't apply.

axw commented

I think we should collect some numbers before closing this. The tooling is there, now we need to use it.

I've run the benchmark and collected some results. I also found a possible issue that I'll report later.

NOTE: please refer to these values instead of the one in this comment

Benchmark results

Environment:

  • a MSK cluster created by this terraform code (within msk module in infra/aws) using AWS MSK default configurations
  • an EC2 t2.micro instance running queuebench, with production duration of 5m and timeout 25m (as instance size matters, is there any guidance on the instance size to use?)

Out of 3 runs:

  • produced and consumed an average of total 27405450.67+-0.004% events
  • produced an average of
    • 91,352+-0.004% events per second
    • 94,451,547.28+-0.004% bytes per second
  • consumed an average of
    • 19,717+-3.88% events per second
    • 20,386,550.88+-3.88% bytes per second

Production duration was around 5m: average 300.000015s+-0%.
Consumption duration was around 22m: average 1401.789669s+-3.64%.

Percentiles for consumer delay:

  • p50 is between 500ms and 750ms in all 3 cases
  • p90 is between 750ms and 1000ms in 1 case, between 1000ms and 2500ms in 2 case
  • p95 is between 750ms and 1000ms in 1 case, between 1000ms and 2500ms in 2 case

Results are with a caveat: consumer.messages.delay is a Float64Histogram so we have a limited resolution in the final data. That's why "is between", those are the histogram boundaries.
The default boundaries for the histogram are

0 5 10 25 50 75 100 250 500 750 1000 2500 5000 7500 10000

Looking at results and distribution of latency in the histogram I suspect most of our tests will always give a p90 and p95 of 1000ms or 2500ms as roughly 40% of measures are in in that area (unless benchmark conditions change/relevant performance improvements).

We can be ok with this, but I don't think this provides enough resolution.

The easiest solution would be to change the histogram boundaries in queuebench, I discussed with @dmathieu and there is a caveat: histogram buckets in otel are global, we are setting some value in apmotel (as defined by Elastic APM specs) so we may not be able to use the same in production and in benchmarks. I'm not sure how much this is relevant.

Another solution could be to not rely on otel collected metrics and use a different approach (either through extracting metrics from traces or a custom approach).

@endorama Thanks for posting the results.

an EC2 t2.micro instance running queuebench, with production duration of 5m and timeout 25m (as instance size matters, is there any guidance on the instance size to use?)

I would steer clear from tx.xxx instance types. They will rely on CPU credits heavily as soon as any CPU is required. Also a t2.micro instance type will not have a lot of hardware resources assigned to it.

Instead, I'd use a general purpose hardware that's a bit more powerful and where network, CPU, memory or Disk I/O won't be a constraint. An m6g.large or m7g.large seem like a good fit to begin with.

Is the same instance used to produce and consume?

Is the same instance used to produce and consume?

Yes

I would steer clear from tx.xxx instance types.

Indeed, I already run the tests on c6i.xlarge, which is the same used by apmbench https://github.com/elastic/apm-server/blob/e2c2ed056f3bfa2cd8b2c6a218fa700c34552cf4/testing/infra/terraform/modules/benchmark_executor/variables.tf#L15-L19

An m6g.large or m7g.large seem like a good fit to begin with.

Any particular reason to choose this instance type?

Reporting benchmark run on a c6i.xlarge, duration: 5m, timeout: 25m, event-size: 1024, partitions: 1.

  • production
    • produced 27533659 events in 300.000631471 seconds
    • produced events per second: 91778.6701481046
    • produced MB per second: 94.89341383520357
  • consumption
    • consumed 27533659 events in 925.301167017 seconds
    • consumed events per second: 29756.429562024037
    • consumed MB per second: 30.76629003373663
  • error rate
    • not consumed: 0 (0.00%)
  • consumption latency (seconds)
    • p50: 362.039
    • p90: 724.077
    • p99: 724.077
axw commented

@marclop FWIW @endorama and I discussed the burstiness of t2 last week, and I suggested that given the low variance on the results they should be reliable. Otherwise if bursting had impacted the results, we should expect to see higher variance.

Looking at the latest numbers, seems that was fair for producing but not so much for consuming.

Thanks for posting the results! seems like producing to a topic with 1 partition has pretty good throughput. Given how the number of partitions controls the maximum concurrency for a given topic, I'd be interested to see the throughput with 2 and 4 partitions and see how that compares.

Is there a way to run this centrally? Or are you running them from your workstation?

I'm running them from a dedicated EC2 machine (infra setup in https://github.com/elastic/apm-managed-service/pull/274). I don't know if we have a team shared EC2 keypair, or we could add team members SSH keys to a cloud-init script for the instance.

I'd be interested to see the throughput with 2 and 4 partitions and see how that compares.

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4.

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 2400 1024 1 1338.49317934 300.000007736 1338.400658493 27845782 27845782 95.97 21.51 724.077 1024 1024
300 2400 1024 2 2244.794431333 300.000009827 2244.701624293 54806738 54806738 188.89 25.24 1024 2048 2048
300 3600 1024 4 3900.106025814 300.000473461 3900.105787347 104678128 84430900 360.73 22.4 2048 4096 4096
300 4800 1024 4 5064.633432419 300.000393132 5064.501086936 106683669 106683669 367.65 21.78 2896.31 5792.62 5792.62

I had to increase the timeout when increasing partitions, as production throughput grows a lot and consumption needs more time to keep up. The test with 4 partitions wasn't able to consume all produced records and timeout out (after 1h). I included the result as we were speaking about production, meanwhile I'm running the test again with a higher timeout.

Are we interested in running the benchmark with different producer/consumer configurations? (es 1 consumer per partition)

Updated table: added 4 partition, timeout 4800, successfully completed

axw commented

@endorama Thanks! The producer numbers look good. Regarding consumer concurrency: I think it would be good to first figure out what the limit for one consumer is - we should be able to do better than 25MB/s.

Could you try increasing MaxPollRecords? Default is 100, so let's try 1000 to see the impact - https://github.com/elastic/apm-queue/blob/70febd6d78e11225cf887d7627993d215758571a/kafka/consumer.go#L54C2-L54C16

We could also try increasing https://pkg.go.dev/github.com/twmb/franz-go/pkg/kgo#FetchMinBytes for higher throughput. I don't think it should make a difference in the benchmarks, as the producer is outpacing the consumer quite significantly, so there should always be more than the minimum amount to fetch.

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords set to 1000.

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.692624063 300.000008996 301.600596564 27679505 27679505 95.4 94.89 2 4 4
300 2400 1024 2 315.899254262 300.003819575 315.80385852 54699840 54699840 188.51 179.08 16 22.6274 22.6274
300 4800 1024 4 724.519516157 300.000006448 724.40076306 105248199 105248199 362.7 150.21 256 512 512

I also tried increasing FetchMinBytes to 10.

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords set to 1000, FetchMinBytes(10).

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.690593174 300.00000779 301.600345584 27547301 27547301 94.94 94.44 2.82843 4 5.65685
300 2400 1024 2 345.577030678 300.000008109 345.501031635 54632726 54632726 188.29 163.49 32 45.2548 64
300 4800 1024 4 668.72929976 300.000008379 668.600260054 105607322 105607322 363.94 163.3 256 362.039 362.039

I also run the tests with MaxPollRecords 10000.

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords set to 10000.

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.520444194 300.000009211 300.401036392 27582662 27582662 95.06 94.94 0.707107 1 1.41421
300 2400 1024 2 300.717424964 300.000008759 300.600821841 54329454 54329454 187.24 186.86 1.41421 2 2
300 4800 1024 4 362.237852628 300.000012872 362.100077259 104928793 104928793 361.59 299.58 64 90.5097 90.5097

Recently I discussed with @simitt these benchmark results and we had 2 doubts:

  1. if the consumption delay was reported in ms or s
  2. if increasing MaxPollRecords was affecting low throughput cases

Consumption delay is reported in seconds (I verified this running a benchmark with a single produced records, thus the histogram bucket the consumption delay goes into is comparable to the consumption duration registered).

To verify if low throughput was being negatively impacted by increasing MaxPollRecords I changed the benchmark production logic to only produce 1 event per second for the specified duration. I run benchmark with MaxPollRecords=100 (default), MaxPollRecords=1000 and MaxPollRecords=10000.

Here are the results

With MaxPollRecords=100:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.183734152 300.000530789 300.101155856 300 300 0 0 0.00390625 0.00552427 0.00552427
300 2400 1024 2 300.245170568 300.000985359 300.102103178 300 300 0 0 0.00390625 0.00390625 0.00390625
300 4800 1024 4 300.226847042 300.000155446 300.100669628 300 300 0 0 0.00552427 0.00552427 0.00552427

With MaxPollRecords=1000:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.208345594 300.000716583 300.100997146 300 300 0 0 0.00552427 0.0078125 0.0078125
300 2400 1024 2 300.175919277 300.000593529 300.100772963 300 300 0 0 0.00552427 0.00552427 0.0078125
300 4800 1024 4 300.214639527 300.000472654 300.101007325 300 300 0 0 0.00552427 0.0078125 0.0078125

With MaxPollRecords=10000:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.191874379 300.00073932 300.101061813 300 300 0 0 0.0078125 0.0078125 0.0078125
300 2400 1024 2 300.231933353 300.00022502 300.101230581 300 300 0 0 0.00552427 0.00552427 0.00552427
300 4800 1024 4 300.226024851 300.000247006 300.100921924 300 300 0 0 0.00552427 0.00552427 0.00552427

Note that 0 produced/consumed MB/s is a rounding error, the value is smaller than 0.00x

It seems there is a impact with worst case p95 being ~8ms vs ~5ms for the 1000 case, but that has not been observed for the 10000 case.
It seems that the performance penalty is also tied to number of partitions, with 4 partitions performing better than 1 when increasing MaxPollRecords.

I investigated Kafka consumer settings that could affect throughput. Please correct anything that seems out of order :)

The most relevant Kafka consumer settings that affect throughput are (some of these seems more connected with throughput in multiple partitions scenarios - we have yet to discuss those):

Setting Kafka default (ours) Effect
fetch.min.bytes 1B The minimum amount of bytes a client wants to receive. If the broker receives a poll request and there are fewer bytes than this setting, will wait up to fetch.max.wait.ms.
fetch.max.wait.ms 500ms The amount of time to wait for fetch.min.bytes to be true.
fetch.max.bytes 52428800B The maximum amount of bytes the server returns for a fetch request. This is not an absolute maximum. Consumer performs fetches in parallel so this value does not determine the maximum bandwidth used.
max.partition.fetch.bytes 1048576B The maximum number of bytes sent by broker per partition. Affects memory usage and risk of consumer timing out (more fetched data may relate to increased poll loop processing time potentially causing consumer timing out and rebalancing).
max.poll.records 500 (100) The maximum number of records that a single poll request returns. As per max.partition.fetch.bytes processing time considerations apply. I'm not sure if there are waiting times implied here, it seems not. Throughput improvements come from fewer network roundtrips. May also affect memory usage due to processing more data. Is to be noted that increasing max.poll.records increase the risk of timeout (in case .poll() is not called again before the max.poll.interval.ms timeout - which defaults to 5m).

How those settings related to config options in apm-queue/kafka.ConsumerConfig?

Kafka setting ConsumerConfig option
fetch.min.bytes no option available
fetch.max.wait.ms MaxPollWait
fetch.max.bytes MaxPollBytes
max.partition.fetch.bytes MaxPollPartitionBytes
max.poll.records MaxPollRecords

I run 3 benchmarks changing these settings:

  1. MaxPollRecords=1000, FetchMinBytes=1
  2. MaxPollRecords=10000, FetchMinBytes=1
  3. MaxPollRecords=1000, FetchMinBytes=10

(1) and (2) performed better than tests with MaxPollRecords=100.
(3) performed worse than (1).

Action taken:

  • I created a PR to add more details to relevant consumer configs in apm-queue: #278
  • I'm running a benchmark with MaxPollRecords=1000, FetchMinBytes=10 but reducing fetch.max.wait.ms from 500ms to 100ms.
  • I plan to transfer some of this into our docs
axw commented

Thanks for digging into all this @endorama!

@marclop and I discussed this a little bit yesterday, and Marc raised a good point: as long as we are using At Most Once Delivery, increasing MaxPollRecords will increase our exposure to data loss. That's because we commit the offsets before processing.

So setting to 10000 is perhaps too risky. Setting to 500-1000 would be safer -- generally we should set it to the lowest number that would satisfy throughput requirements.

To complete the picture, I run these additional runs (I'm starting to call each a "scenario" because the variables at play are growing).
Duration is always 300. "Kept up" means if consumer was able to (almost) keep up with producer (by time & consumed bytes per second); yes means consuming events took the same time or no more than 125%; no means consuming events took at least 2x than producing them.
Remember that Consumption delay is in seconds.

Max Poll Records Fetch Min Bytes Max Poll Wait partitions kept up consumption delay p50 consumption delay p90 consumption delay p95
1000 1 500ms 1 โœ… 2 4 4
1000 1 500ms 2 โœ… 16 22.6274 22.6274
1000 1 500ms 4 โŒ 256 512 512
1000 10 500ms 1 โœ… 2.82843 4 5.65685
1000 10 500ms 2 โœ… 32 45.2548 64
1000 10 500ms 4 โŒ 256 362.039 362.039
1000 10 100ms 1 โœ… 2.82843 4 4
1000 10 100ms 2 โœ… 2.82843 4 4
1000 10 100ms 4 โŒ 256 512 512
1000 100 100ms 1 โœ… 2.82843 4 5.65685
1000 100 100ms 2 โœ… 45.2548 90.5097 90.5097
1000 100 100ms 3 โŒ 256 362.039 362.039
10000 1 500ms 1 โœ… 0.707107 1 1.41421
10000 1 500ms 2 โœ… 1.41421 2 2
10000 1 500ms 4 โœ… 64 90.5097 90.5097
10000 10 100ms 1 โœ… 0.353553 0.707107 1.41421
10000 10 100ms 2 โœ… 1 1.41421 1.41421
10000 10 100ms 4 โœ… 64 90.5097 90.5097
All results

scenario MaxPollRecords=1000 FetchMinBytes=1 MaxPollWait: 500ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.692624063 300.000008996 301.600596564 27679505 27679505 95.4 94.89 2 4 4
300 2400 1024 2 315.899254262 300.003819575 315.80385852 54699840 54699840 188.51 179.08 16 22.6274 22.6274
300 4800 1024 4 724.519516157 300.000006448 724.40076306 105248199 105248199 362.7 150.21 256 512 512

scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 500ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.690593174 300.00000779 301.600345584 27547301 27547301 94.94 94.44 2.82843 4 5.65685
300 2400 1024 2 345.577030678 300.000008109 345.501031635 54632726 54632726 188.29 163.49 32 45.2548 64
300 4800 1024 4 668.72929976 300.000008379 668.600260054 105607322 105607322 363.94 163.3 256 362.039 362.039

scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.878959177 300.001936163 301.80288708 27649339 27649339 95.29 94.72 2.82843 4 4
300 2400 1024 2 301.487009332 300.000008718 301.400786651 53773842 53773842 185.32 184.46 2.82843 4 4
300 4800 1024 4 716.016393495 300.000007006 715.90085789 106822033 106822033 368.13 154.26 256 512 512

scenario MaxPollRecords=1000 FetchMinBytes=100 MaxPollWait: 100ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.588041102 300.000009822 301.501001782 27537477 27537477 94.91 94.43 2.82843 4 5.65685
300 2400 1024 2 368.110207242 300.00000629 368.000307641 54652175 54652175 188.35 153.55 45.2548 90.5097 90.5097
300 4800 1024 4 633.922515154 300.000009003 633.800318506 105606356 105606356 363.93 172.26 256 362.039 362.039

scenario MaxPollRecords=10000 FetchMinBytes=1 MaxPollWait=500ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.520444194 300.000009211 300.401036392 27582662 27582662 95.06 94.94 0.707107 1 1.41421
300 2400 1024 2 300.717424964 300.000008759 300.600821841 54329454 54329454 187.24 186.86 1.41421 2 2
300 4800 1024 4 362.237852628 300.000012872 362.100077259 104928793 104928793 361.59 299.58 64 90.5097 90.5097

scenario MaxPollRecords=10000 FetchMinBytes=10 MaxPollWait: 100ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.381111593 300.00085311 300.301392553 27547995 27547995 94.94 94.85 0.353553 0.707107 1.41421
300 2400 1024 2 300.512250301 300.000009904 300.40044246 54298777 54298777 187.13 186.88 1 1.41421 1.41421
300 4800 1024 4 358.790258513 300.000017024 358.700447427 104625464 104625464 360.55 301.55 64 90.5097 90.5097

From these results I think we can drive some conclusions:

  • MaxPollRecords=1000 enables a stable (production/consumption) throughput of ~185MB/s with 2 partitions
  • MaxPollWait=500ms can be reduced
  • FetchMinBytes=10 helps in the scenarios with 1 or 2 partitions, less in the scenarios with 4

The sweet spot so far, also considering the concerns in the previous comment seems to be MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms

Can we set a number for our throughput requirements so to compare these results with it?
(I'm not sure how much the fixed event size of 1024 is affecting these results, so please keep that in mind).

I think this can be closed now, @axw what do you think? I'd open a follow-up issue to document the findings in our docs.

axw commented

Yep, let's close it. Thanks for digging into it @endorama!

In the end I forgot to follow-up with the docs PR, which is now up here https://github.com/elastic/apm-managed-service/pull/619

As tables in docsmobile do not handle well all the columns needed to display this benchmark, I'm doing a recap of the raw data here. I'll link to this comment from our documentation, leveraging GitHub tables.

All scenarios run on a c6i.xlarge with a single program using 1 producer and 1 consumer to produce and consume data for the specified amount of time (300 seconds).

More details on Kafka settings can be found in this comment.

scenario MaxPollRecords=100 FetchMinBytes=1 MaxPollWait: 500ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 2400 1024 1 1338.49317934 300.000007736 1338.400658493 27845782 27845782 95.97 21.51 724.077 1024 1024
300 2400 1024 2 2244.794431333 300.000009827 2244.701624293 54806738 54806738 188.89 25.24 1024 2048 2048
300 3600 1024 4 3900.106025814 300.000473461 3900.105787347 104678128 84430900 360.73 22.4 2048 4096 4096
300 4800 1024 4 5064.633432419 300.000393132 5064.501086936 106683669 106683669 367.65 21.78 2896.31 5792.62 5792.62

scenario MaxPollRecords=1000 FetchMinBytes=1 MaxPollWait: 500ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.692624063 300.000008996 301.600596564 27679505 27679505 95.4 94.89 2 4 4
300 2400 1024 2 315.899254262 300.003819575 315.80385852 54699840 54699840 188.51 179.08 16 22.6274 22.6274
300 4800 1024 4 724.519516157 300.000006448 724.40076306 105248199 105248199 362.7 150.21 256 512 512

scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 500ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.690593174 300.00000779 301.600345584 27547301 27547301 94.94 94.44 2.82843 4 5.65685
300 2400 1024 2 345.577030678 300.000008109 345.501031635 54632726 54632726 188.29 163.49 32 45.2548 64
300 4800 1024 4 668.72929976 300.000008379 668.600260054 105607322 105607322 363.94 163.3 256 362.039 362.039

scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.878959177 300.001936163 301.80288708 27649339 27649339 95.29 94.72 2.82843 4 4
300 2400 1024 2 301.487009332 300.000008718 301.400786651 53773842 53773842 185.32 184.46 2.82843 4 4
300 4800 1024 4 716.016393495 300.000007006 715.90085789 106822033 106822033 368.13 154.26 256 512 512

scenario MaxPollRecords=1000 FetchMinBytes=100 MaxPollWait: 100ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 301.588041102 300.000009822 301.501001782 27537477 27537477 94.91 94.43 2.82843 4 5.65685
300 2400 1024 2 368.110207242 300.00000629 368.000307641 54652175 54652175 188.35 153.55 45.2548 90.5097 90.5097
300 4800 1024 4 633.922515154 300.000009003 633.800318506 105606356 105606356 363.93 172.26 256 362.039 362.039

scenario MaxPollRecords=10000 FetchMinBytes=1 MaxPollWait=500ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.520444194 300.000009211 300.401036392 27582662 27582662 95.06 94.94 0.707107 1 1.41421
300 2400 1024 2 300.717424964 300.000008759 300.600821841 54329454 54329454 187.24 186.86 1.41421 2 2
300 4800 1024 4 362.237852628 300.000012872 362.100077259 104928793 104928793 361.59 299.58 64 90.5097 90.5097

scenario MaxPollRecords=10000 FetchMinBytes=10 MaxPollWait: 100ms:

(config) duration (config) timeout (config) event size (config) partitions total duration duration (production) duration (consumption) produced record/s consumed record/s produced (MB/s) consumed (MB/s) consumption delay (p50) consumption delay (p90) consumption delay (p95)
300 1500 1024 1 300.381111593 300.00085311 300.301392553 27547995 27547995 94.94 94.85 0.353553 0.707107 1.41421
300 2400 1024 2 300.512250301 300.000009904 300.40044246 54298777 54298777 187.13 186.88 1 1.41421 1.41421
300 4800 1024 4 358.790258513 300.000017024 358.700447427 104625464 104625464 360.55 301.55 64 90.5097 90.5097