strimzi/kafka-quotas-plugin

Ignored quotas and client groups metrics

fvaleri opened this issue · 6 comments

After enabling the Strimzi quotas plugin, any default Kafka quotas plugin configuration is ignored without any warning and you also lose per client group metrics. The metrics part is a bit annoying, as the plugin only shows the aggregated per quota type metrics (FETCH, PRODUCE, REQUEST), rather than per client group metrics (e.g. my-user, my-user-new). This is useful information when the cluster admin doesn't have access to client-side metrics. I think this behavior is driven by the ClientQuotaCallback.quotaMetricTags implementation.

For example, if I have a Kafka CR like this:

spec:
  kafka:
    config:
      client.quota.callback.class: io.strimzi.kafka.quotas.StaticQuotaCallback
      client.quota.callback.static.storage.check-interval: 30
      client.quota.callback.static.storage.hard: 32212254720
      client.quota.callback.static.storage.soft: 26843545600
   listeners:
    - authentication:
        type: scram-sha-512
      name: external
      port: 9094
      tls: true
      type: route

And a KafkaUser CR like this:

spec:
  authentication:
    type: scram-sha-512
  quotas:
    producerByteRate: 500000

There is no throttling if I run a producer at full speed (well over 500 KB/s):

strimzi-quotas

Instead, this is what happens with the same producer after I remove the client.quota.callback.* configs and wait for the rolling update to complete:

default-quotas

Feedback from @robobario:

So far per-user quota enforcement has been outside what the plugin is trying to do:

The default quota plugin in Apache Kafka will hand out a unique quota per client. This plugin will configure a total quota independent of the number of clients. For example, if you have configured a produce quota of 40 MB/second, and you have 10 producers running as fast as possible, they will be limited by 4 MB/second each.

You are correct that the metric and quota enforcement is driven by the metrics tag. We tag each request with a Map<String, String> which is used to create/retrieve a unique sensor that is used to apply this quota, this sensor is backed by the rate metric exposed by JMX. To apply the static quota we tag all the produce requests like {"quota.type": "PRODUCE"} so you see that Produce > PRODUCE in the mbeans. The granularity you see in jconsole matches the granularity we are applying the quota at.

The callback is told when per-user/client quotas are configured, currently the StaticQuotaCallback does nothing but we could emit warning logs there about incompatible quota configuration.

In theory we could support the per-user/per-client quotas somehow:

We could make user/clientid quotas take precedence over the static quota (so tag their requests the same way kafka does by default if that user/clientId has a quota defined), and use the static quota only for clients without a specific quota. So if total quota is 40MB/sec and Fred has a 5MB/sec quota, the total potential throughput would be 45MB/sec. We would have one metric in JMX for Fred, and one metric for all other users.

Alternatively we could subtract all the per-client quotas from the static quota. So if I have a 40MB/sec total quota and user Fred has a 5MB/sec quota created, we could reduce the static quota to 35MB/sec, so everyone who's not Fred shares the 35MB/sec. We would have one metric in JMX for Fred, and one metric for all other users. (edit: I don't think this could work because users can set a default quota for users)

Either way we will not have the same metrics as the kafka defaults. The default kafka implementation tracks metrics for other users/clientIds depending on what quotas you have configured. For example if you have a single quota set for user A, by default kafka will collect metrics for all users regardless of whether they have a quota. If you have a single quota set for a clientId, by default kafka will collect metrics for all clientIds and so on. Since we will be sharing a quota across clients somewhere, that quota has to have one set of tags and one metric.

edit: or you could enable users to choose between static quotas or using the per-client quotas. This way you could make the metrics look the same as default in per-client mode.

@robobario thanks for the insight.

The callback is told when per-user/client quotas are configured, currently the StaticQuotaCallback does nothing but we could emit warning logs there about incompatible quota configuration.

That would be great and we should also add some documentation explaining the difference in metrics grouping when switching from default to Strimzi plugin as some administrator may rely on these metrics.

or you could enable users to choose between static quotas or using the per-client quotas. This way you could make the metrics look the same as default in per-client mode.

Well, we already have the default plugin for that, unless you mean having both disk protection and per-client metrics.

Well, we already have the default plugin for that, unless you mean having both disk protection and per-client metrics.

Yes, then users could choose to keep applying their existing quota configuration but get out of disk protection.

It would be a little painful in that we would have to reimplement the same logic that Kafka provides out of the box in our custom callback.

@fvaleri I this still relevant? My understanding is that the user has to choose one or the other and we currently do not plan to change it.

@scholzj I guess we just have to document the different behavior with regards to JMX metrics. Some admins rely on this information, so they need to know that they will lose them when using this plugin.

@fvaleri Ok, would you be able to open a PR to add that to the README file? I guess that is the only documentation at this point. (Or if it was meant to be for the old version, I guess you could also add it here: https://strimzi.io/docs/operators/latest/full/deploying.html#proc-setting-broker-limits-str) Thanks.