Benchmarking: Collect throughput metrics for AWS MSK

Question

Benchmarking: Collect throughput metrics for AWS MSK

Closed this issue 9 months ago · 22 comments

Description

We want to test and benchmark producing and consuming against Kafka. The goal is to establish a throughput baseline and we'll use the cloud provider AWS MSK for this. Since

We MUST collect throughput and usage metrics.

Design

To keep things simple, we could start by creating a single command under cmd/queuebench. We may retrofit the functionality into testing.B, but it'll be quicker and less cumbersome to start with a single command that runs a single benchmark for some time and prints the results.

We'll run a single producer and consumer for the benchmark, collect the metrics using a Manual OTLP reader.

https://github.com/elastic/apm-queue/blob/63f6f01bf68a76aad5fd2f7bd44af0f7519f8e16/kafka/metrics_test.go#L323C2-L323C2

After the benchmark has finished, we'll collect the metrics that we want to print, perform any calculations (such as events/s and bytes/s).

Setup

The idea is to run a the benchmarks against a single topic (for now) and a single partition (we can expose a -partitions flag which allows setting the number of partitions for the topic, but default to 1 for now). The command will:

Create a unique topic name with the specified number of partitions (-partitions).
Generate N events with the specified byte size (-event-size).
Create a manual reader for OTLP Exporter.
Create a consumer and start consuming in a goroutine.
While timer isn't exceeded, produce events.
Stop producer.
Stop consumer after all events have been consumed.
Collect metrics and perform any necessary calculations to print the results.

Inputs

-event-size in bytes, default to 1024 (1KB).
-partitions default to 1. (keep it to a single topic producing for now).
-duration the amount of time to run the benchmark for.

Output (derived from metrics)

For now, we can just output the metrics in a table format. We may change this over time as we run this benchmarks periodically and in CI.

Bytes/s
Events/s
Produce error rate in %.
Consumption latency. We could print p50, p90, p99.

Answer 1 · 2023-07-25T05:34:26.000Z

@marclop @axw could you narrow the expectations for this task a bit more and add some definition of how the benchmark setup should look like.

Answer 2 · 2023-07-27T12:49:38.000Z

Adding here some questions:

do we have tooling to generate data?
I expect we are not yet collecting all the needed metrics for this benchmark, is that correct?
Do we store those metrics long term?
How should this benchmark system be monitored?

Answer 3 · 2023-07-28T04:48:04.000Z

@endorama I've updated the issue to include more details. It was very brief before and contained incorrect metrics to collect.

do we have tooling to generate data?

My thinking is that we can keep that as a variable and just generate random bytes to produce and have the size be user-specified.

I expect we are not yet collecting all the needed metrics for this benchmark, is that correct?

After updating the metrics to output, we're recording all the metrics we need to print.

Do we store those metrics long term?

Not for now. We may store the output as .txt files in the future when the benchmarks are set up to run in CI.

How should this benchmark system be monitored?

I'm not sure I understand the question. The current task only covers the implementation of a command to run benchmarks against an AWS MSK Serverless cluster. We're not going to be working on the CI bit just yet, so I think this doesn't apply.

Answer 4 · 2023-08-28T02:50:54.000Z

I think we should collect some numbers before closing this. The tooling is there, now we need to use it.

Answer 5 · 2023-08-31T11:01:03.000Z

I've run the benchmark and collected some results. I also found a possible issue that I'll report later.

NOTE: please refer to these values instead of the one in this comment

Benchmark results

Environment:

a MSK cluster created by this terraform code (within msk module in infra/aws) using AWS MSK default configurations
an EC2 t2.micro instance running queuebench, with production duration of 5m and timeout 25m (as instance size matters, is there any guidance on the instance size to use?)

Out of 3 runs:

produced and consumed an average of total 27405450.67+-0.004% events
produced an average of
- 91,352+-0.004% events per second
- 94,451,547.28+-0.004% bytes per second
consumed an average of
- 19,717+-3.88% events per second
- 20,386,550.88+-3.88% bytes per second

Production duration was around 5m: average 300.000015s+-0%.
Consumption duration was around 22m: average 1401.789669s+-3.64%.

Percentiles for consumer delay:

p50 is between 500ms and 750ms in all 3 cases
p90 is between 750ms and 1000ms in 1 case, between 1000ms and 2500ms in 2 case
p95 is between 750ms and 1000ms in 1 case, between 1000ms and 2500ms in 2 case

Results are with a caveat: consumer.messages.delay is a Float64Histogram so we have a limited resolution in the final data. That's why "is between", those are the histogram boundaries.
The default boundaries for the histogram are

0	5	10	25	50	75	100	250	500	750	1000	2500	5000	7500	10000

Looking at results and distribution of latency in the histogram I suspect most of our tests will always give a p90 and p95 of 1000ms or 2500ms as roughly 40% of measures are in in that area (unless benchmark conditions change/relevant performance improvements).

We can be ok with this, but I don't think this provides enough resolution.

The easiest solution would be to change the histogram boundaries in queuebench, I discussed with @dmathieu and there is a caveat: histogram buckets in otel are global, we are setting some value in apmotel (as defined by Elastic APM specs) so we may not be able to use the same in production and in benchmarks. I'm not sure how much this is relevant.

Another solution could be to not rely on otel collected metrics and use a different approach (either through extracting metrics from traces or a custom approach).

Answer 6 · 2023-09-04T01:32:45.000Z

@endorama Thanks for posting the results.

an EC2 t2.micro instance running queuebench, with production duration of 5m and timeout 25m (as instance size matters, is there any guidance on the instance size to use?)

I would steer clear from tx.xxx instance types. They will rely on CPU credits heavily as soon as any CPU is required. Also a t2.micro instance type will not have a lot of hardware resources assigned to it.

Instead, I'd use a general purpose hardware that's a bit more powerful and where network, CPU, memory or Disk I/O won't be a constraint. An m6g.large or m7g.large seem like a good fit to begin with.

Is the same instance used to produce and consume?

Is the same instance used to produce and consume?

Yes

Answer 7 · 2023-09-04T15:30:57.000Z

I would steer clear from tx.xxx instance types.

Indeed, I already run the tests on c6i.xlarge, which is the same used by apmbench https://github.com/elastic/apm-server/blob/e2c2ed056f3bfa2cd8b2c6a218fa700c34552cf4/testing/infra/terraform/modules/benchmark_executor/variables.tf#L15-L19

An m6g.large or m7g.large seem like a good fit to begin with.

Any particular reason to choose this instance type?

Answer 8 · 2023-09-04T16:35:15.000Z

Reporting benchmark run on a c6i.xlarge, duration: 5m, timeout: 25m, event-size: 1024, partitions: 1.

production
- produced 27533659 events in 300.000631471 seconds
- produced events per second: 91778.6701481046
- produced MB per second: 94.89341383520357
consumption
- consumed 27533659 events in 925.301167017 seconds
- consumed events per second: 29756.429562024037
- consumed MB per second: 30.76629003373663
error rate
- not consumed: 0 (0.00%)
consumption latency (seconds)
- p50: 362.039
- p90: 724.077
- p99: 724.077

Answer 9 · 2023-09-05T02:44:17.000Z

@marclop FWIW @endorama and I discussed the burstiness of t2 last week, and I suggested that given the low variance on the results they should be reliable. Otherwise if bursting had impacted the results, we should expect to see higher variance.

Looking at the latest numbers, seems that was fair for producing but not so much for consuming.

Answer 10 · 2023-09-05T05:25:32.000Z

Thanks for posting the results! seems like producing to a topic with 1 partition has pretty good throughput. Given how the number of partitions controls the maximum concurrency for a given topic, I'd be interested to see the throughput with 2 and 4 partitions and see how that compares.

Is there a way to run this centrally? Or are you running them from your workstation?

Answer 11 · 2023-09-05T08:15:01.000Z

I'm running them from a dedicated EC2 machine (infra setup in https://github.com/elastic/apm-managed-service/pull/274). I don't know if we have a team shared EC2 keypair, or we could add team members SSH keys to a cloud-init script for the instance.

Answer 12 · 2023-09-08T08:34:45.000Z

I'd be interested to see the throughput with 2 and 4 partitions and see how that compares.

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4.

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	2400	1024	1	1338.49317934	300.000007736	1338.400658493	27845782	27845782	95.97	21.51	724.077	1024	1024
300	2400	1024	2	2244.794431333	300.000009827	2244.701624293	54806738	54806738	188.89	25.24	1024	2048	2048
300	3600	1024	4	3900.106025814	300.000473461	3900.105787347	104678128	84430900	360.73	22.4	2048	4096	4096
300	4800	1024	4	5064.633432419	300.000393132	5064.501086936	106683669	106683669	367.65	21.78	2896.31	5792.62	5792.62

I had to increase the timeout when increasing partitions, as production throughput grows a lot and consumption needs more time to keep up. The test with 4 partitions wasn't able to consume all produced records and timeout out (after 1h). I included the result as we were speaking about production, meanwhile I'm running the test again with a higher timeout.

Are we interested in running the benchmark with different producer/consumer configurations? (es 1 consumer per partition)

Updated table: added 4 partition, timeout 4800, successfully completed

Answer 13 · 2023-09-08T09:02:36.000Z

@endorama Thanks! The producer numbers look good. Regarding consumer concurrency: I think it would be good to first figure out what the limit for one consumer is - we should be able to do better than 25MB/s.

Could you try increasing MaxPollRecords? Default is 100, so let's try 1000 to see the impact - https://github.com/elastic/apm-queue/blob/70febd6d78e11225cf887d7627993d215758571a/kafka/consumer.go#L54C2-L54C16

We could also try increasing https://pkg.go.dev/github.com/twmb/franz-go/pkg/kgo#FetchMinBytes for higher throughput. I don't think it should make a difference in the benchmarks, as the producer is outpacing the consumer quite significantly, so there should always be more than the minimum amount to fetch.

Answer 14 · 2023-09-08T13:01:28.000Z

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords set to 1000.

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	301.692624063	300.000008996	301.600596564	27679505	27679505	95.4	94.89	2	4	4
300	2400	1024	2	315.899254262	300.003819575	315.80385852	54699840	54699840	188.51	179.08	16	22.6274	22.6274
300	4800	1024	4	724.519516157	300.000006448	724.40076306	105248199	105248199	362.7	150.21	256	512	512

I also tried increasing FetchMinBytes to 10.

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords set to 1000, FetchMinBytes(10).

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	301.690593174	300.00000779	301.600345584	27547301	27547301	94.94	94.44	2.82843	4	5.65685
300	2400	1024	2	345.577030678	300.000008109	345.501031635	54632726	54632726	188.29	163.49	32	45.2548	64
300	4800	1024	4	668.72929976	300.000008379	668.600260054	105607322	105607322	363.94	163.3	256	362.039	362.039

I also run the tests with MaxPollRecords 10000.

Reporting benchmark run on a c6i.xlarge, 1 producer and 1 consumer with different partitions: 1, 2, 4, Kafka consumer MaxPollRecords set to 10000.

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	300.520444194	300.000009211	300.401036392	27582662	27582662	95.06	94.94	0.707107	1	1.41421
300	2400	1024	2	300.717424964	300.000008759	300.600821841	54329454	54329454	187.24	186.86	1.41421	2	2
300	4800	1024	4	362.237852628	300.000012872	362.100077259	104928793	104928793	361.59	299.58	64	90.5097	90.5097

Answer 15 · 2023-09-13T14:28:19.000Z

Recently I discussed with @simitt these benchmark results and we had 2 doubts:

if the consumption delay was reported in ms or s
if increasing MaxPollRecords was affecting low throughput cases

Consumption delay is reported in seconds (I verified this running a benchmark with a single produced records, thus the histogram bucket the consumption delay goes into is comparable to the consumption duration registered).

To verify if low throughput was being negatively impacted by increasing MaxPollRecords I changed the benchmark production logic to only produce 1 event per second for the specified duration. I run benchmark with MaxPollRecords=100 (default), MaxPollRecords=1000 and MaxPollRecords=10000.

Here are the results

With MaxPollRecords=100:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	300.183734152	300.000530789	300.101155856	300	300	0.00390625	0.00552427	0.00552427
300	2400	1024	2	300.245170568	300.000985359	300.102103178	300	300	0.00390625	0.00390625	0.00390625
300	4800	1024	4	300.226847042	300.000155446	300.100669628	300	300	0.00552427	0.00552427	0.00552427

With MaxPollRecords=1000:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	300.208345594	300.000716583	300.100997146	300	300	0.00552427	0.0078125	0.0078125
300	2400	1024	2	300.175919277	300.000593529	300.100772963	300	300	0.00552427	0.00552427	0.0078125
300	4800	1024	4	300.214639527	300.000472654	300.101007325	300	300	0.00552427	0.0078125	0.0078125

With MaxPollRecords=10000:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	300.191874379	300.00073932	300.101061813	300	300	0.0078125	0.0078125	0.0078125
300	2400	1024	2	300.231933353	300.00022502	300.101230581	300	300	0.00552427	0.00552427	0.00552427
300	4800	1024	4	300.226024851	300.000247006	300.100921924	300	300	0.00552427	0.00552427	0.00552427

Note that 0 produced/consumed MB/s is a rounding error, the value is smaller than 0.00x

It seems there is a impact with worst case p95 being ~8ms vs ~5ms for the 1000 case, but that has not been observed for the 10000 case.
It seems that the performance penalty is also tied to number of partitions, with 4 partitions performing better than 1 when increasing MaxPollRecords.

Answer 16 · 2023-09-13T16:28:45.000Z

I investigated Kafka consumer settings that could affect throughput. Please correct anything that seems out of order :)

The most relevant Kafka consumer settings that affect throughput are (some of these seems more connected with throughput in multiple partitions scenarios - we have yet to discuss those):

Setting	Kafka default (ours)	Effect
`fetch.min.bytes`	1B	The minimum amount of bytes a client wants to receive. If the broker receives a poll request and there are fewer bytes than this setting, will wait up to `fetch.max.wait.ms`.
`fetch.max.wait.ms`	500ms	The amount of time to wait for fetch.min.bytes to be true.
`fetch.max.bytes`	52428800B	The maximum amount of bytes the server returns for a fetch request. This is not an absolute maximum. Consumer performs fetches in parallel so this value does not determine the maximum bandwidth used.
`max.partition.fetch.bytes`	1048576B	The maximum number of bytes sent by broker per partition. Affects memory usage and risk of consumer timing out (more fetched data may relate to increased poll loop processing time potentially causing consumer timing out and rebalancing).
`max.poll.records`	500 (100)	The maximum number of records that a single poll request returns. As per `max.partition.fetch.bytes` processing time considerations apply. I'm not sure if there are waiting times implied here, it seems not. Throughput improvements come from fewer network roundtrips. May also affect memory usage due to processing more data. Is to be noted that increasing `max.poll.records` increase the risk of timeout (in case `.poll()` is not called again before the `max.poll.interval.ms` timeout - which defaults to 5m).

How those settings related to config options in apm-queue/kafka.ConsumerConfig?

Kafka setting	ConsumerConfig option
`fetch.min.bytes`	no option available
`fetch.max.wait.ms`	`MaxPollWait`
`fetch.max.bytes`	`MaxPollBytes`
`max.partition.fetch.bytes`	MaxPollPartitionBytes
`max.poll.records`	MaxPollRecords

I run 3 benchmarks changing these settings:

MaxPollRecords=1000, FetchMinBytes=1
MaxPollRecords=10000, FetchMinBytes=1
MaxPollRecords=1000, FetchMinBytes=10

(1) and (2) performed better than tests with MaxPollRecords=100.
(3) performed worse than (1).

Action taken:

I created a PR to add more details to relevant consumer configs in apm-queue: #278
I'm running a benchmark with MaxPollRecords=1000, FetchMinBytes=10 but reducing fetch.max.wait.ms from 500ms to 100ms.
I plan to transfer some of this into our docs

Answer 17 · 2023-09-14T02:33:49.000Z

Thanks for digging into all this @endorama!

@marclop and I discussed this a little bit yesterday, and Marc raised a good point: as long as we are using At Most Once Delivery, increasing MaxPollRecords will increase our exposure to data loss. That's because we commit the offsets before processing.

So setting to 10000 is perhaps too risky. Setting to 500-1000 would be safer -- generally we should set it to the lowest number that would satisfy throughput requirements.

Answer 18 · 2023-09-14T16:32:51.000Z

To complete the picture, I run these additional runs (I'm starting to call each a "scenario" because the variables at play are growing).
Duration is always 300. "Kept up" means if consumer was able to (almost) keep up with producer (by time & consumed bytes per second); yes means consuming events took the same time or no more than 125%; no means consuming events took at least 2x than producing them.
Remember that Consumption delay is in seconds.

Max Poll Records	Fetch Min Bytes	Max Poll Wait	partitions	kept up	consumption delay p50	consumption delay p90	consumption delay p95
1000	1	500ms	1	✅	2	4	4
1000	1	500ms	2	✅	16	22.6274	22.6274
1000	1	500ms	4	❌	256	512	512
1000	10	500ms	1	✅	2.82843	4	5.65685
1000	10	500ms	2	✅	32	45.2548	64
1000	10	500ms	4	❌	256	362.039	362.039
1000	10	100ms	1	✅	2.82843	4	4
1000	10	100ms	2	✅	2.82843	4	4
1000	10	100ms	4	❌	256	512	512
1000	100	100ms	1	✅	2.82843	4	5.65685
1000	100	100ms	2	✅	45.2548	90.5097	90.5097
1000	100	100ms	3	❌	256	362.039	362.039
10000	1	500ms	1	✅	0.707107	1	1.41421
10000	1	500ms	2	✅	1.41421	2	2
10000	1	500ms	4	✅	64	90.5097	90.5097
10000	10	100ms	1	✅	0.353553	0.707107	1.41421
10000	10	100ms	2	✅	1	1.41421	1.41421
10000	10	100ms	4	✅	64	90.5097	90.5097

All results

scenario MaxPollRecords=1000 FetchMinBytes=1 MaxPollWait: 500ms:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	301.692624063	300.000008996	301.600596564	27679505	27679505	95.4	94.89	2	4	4
300	2400	1024	2	315.899254262	300.003819575	315.80385852	54699840	54699840	188.51	179.08	16	22.6274	22.6274
300	4800	1024	4	724.519516157	300.000006448	724.40076306	105248199	105248199	362.7	150.21	256	512	512

scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 500ms:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	301.690593174	300.00000779	301.600345584	27547301	27547301	94.94	94.44	2.82843	4	5.65685
300	2400	1024	2	345.577030678	300.000008109	345.501031635	54632726	54632726	188.29	163.49	32	45.2548	64
300	4800	1024	4	668.72929976	300.000008379	668.600260054	105607322	105607322	363.94	163.3	256	362.039	362.039

scenario MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	301.878959177	300.001936163	301.80288708	27649339	27649339	95.29	94.72	2.82843	4	4
300	2400	1024	2	301.487009332	300.000008718	301.400786651	53773842	53773842	185.32	184.46	2.82843	4	4
300	4800	1024	4	716.016393495	300.000007006	715.90085789	106822033	106822033	368.13	154.26	256	512	512

scenario MaxPollRecords=1000 FetchMinBytes=100 MaxPollWait: 100ms:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	301.588041102	300.000009822	301.501001782	27537477	27537477	94.91	94.43	2.82843	4	5.65685
300	2400	1024	2	368.110207242	300.00000629	368.000307641	54652175	54652175	188.35	153.55	45.2548	90.5097	90.5097
300	4800	1024	4	633.922515154	300.000009003	633.800318506	105606356	105606356	363.93	172.26	256	362.039	362.039

scenario MaxPollRecords=10000 FetchMinBytes=1 MaxPollWait=500ms:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	300.520444194	300.000009211	300.401036392	27582662	27582662	95.06	94.94	0.707107	1	1.41421
300	2400	1024	2	300.717424964	300.000008759	300.600821841	54329454	54329454	187.24	186.86	1.41421	2	2
300	4800	1024	4	362.237852628	300.000012872	362.100077259	104928793	104928793	361.59	299.58	64	90.5097	90.5097

scenario MaxPollRecords=10000 FetchMinBytes=10 MaxPollWait: 100ms:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	1500	1024	1	300.381111593	300.00085311	300.301392553	27547995	27547995	94.94	94.85	0.353553	0.707107	1.41421
300	2400	1024	2	300.512250301	300.000009904	300.40044246	54298777	54298777	187.13	186.88	1	1.41421	1.41421
300	4800	1024	4	358.790258513	300.000017024	358.700447427	104625464	104625464	360.55	301.55	64	90.5097	90.5097

From these results I think we can drive some conclusions:

MaxPollRecords=1000 enables a stable (production/consumption) throughput of ~185MB/s with 2 partitions
MaxPollWait=500ms can be reduced
FetchMinBytes=10 helps in the scenarios with 1 or 2 partitions, less in the scenarios with 4

The sweet spot so far, also considering the concerns in the previous comment seems to be MaxPollRecords=1000 FetchMinBytes=10 MaxPollWait: 100ms

Can we set a number for our throughput requirements so to compare these results with it?
(I'm not sure how much the fixed event size of 1024 is affecting these results, so please keep that in mind).

Answer 19 · 2023-10-10T13:42:16.000Z

I think this can be closed now, @axw what do you think? I'd open a follow-up issue to document the findings in our docs.

Answer 20 · 2023-10-11T01:56:40.000Z

Yep, let's close it. Thanks for digging into it @endorama!

Answer 21 · 2024-02-12T17:37:10.000Z

In the end I forgot to follow-up with the docs PR, which is now up here https://github.com/elastic/apm-managed-service/pull/619

Answer 22 · 2024-02-22T10:53:06.000Z

As tables in docsmobile do not handle well all the columns needed to display this benchmark, I'm doing a recap of the raw data here. I'll link to this comment from our documentation, leveraging GitHub tables.

All scenarios run on a c6i.xlarge with a single program using 1 producer and 1 consumer to produce and consume data for the specified amount of time (300 seconds).

More details on Kafka settings can be found in this comment.

scenario MaxPollRecords=100 FetchMinBytes=1 MaxPollWait: 500ms:

(config) duration	(config) timeout	(config) event size	(config) partitions	total duration	duration (production)	duration (consumption)	produced record/s	consumed record/s	produced (MB/s)	consumed (MB/s)	consumption delay (p50)	consumption delay (p90)	consumption delay (p95)
300	2400	1024	1	1338.49317934	300.000007736	1338.400658493	27845782	27845782	95.97	21.51	724.077	1024	1024
300	2400	1024	2	2244.794431333	300.000009827	2244.701624293	54806738	54806738	188.89	25.24	1024	2048	2048
300	3600	1024	4	3900.106025814	300.000473461	3900.105787347	104678128	84430900	360.73	22.4	2048	4096	4096
300	4800	1024	4	5064.633432419	300.000393132	5064.501086936	106683669	106683669	367.65	21.78	2896.31	5792.62	5792.62