splunk/kafka-connect-splunk

Connector Not Respecting splunk.hec.backoff.threshhold.seconds Value as Expected

A122813 opened this issue · 6 comments

We are seeing what amounts to retry storms in our environments. We have splunk.hec.backoff.threshhold.seconds set to 120 but we are seeing almost instant retries when we have short periods of Hec Endpoint Issues. We would expect the failed batch to wait 120 seconds before retrying. Maybe this is a misreading of the documentation on our end, but this would be a problem we would want to remedy either way.

Thanks.

Splunk Version: 8.2.2201.1
Connector Version: 2.0.8
Kafka Connect Version: 7.2

Connector Config

{
    "name": "5658-Splunk-Flagr",
    "config": {
        "connector.class": "com.splunk.kafka.connect.SplunkSinkConnector",
        "consumer.override.group.id": "5658-A-PD-SplunkSink-Flagr",
        "splunk.hec.raw": "false",
        "tasks.max": "3",
        "consumer.override.sasl.jaas.config": "REDACTED",
        "splunk.hec.max.http.connection.per.channel": "4",
        "splunk.hec.uri": "REDACTED",
        "splunk.hec.json.event.formatted": "false",
        "consumer.override.max.poll.records": "5000",
        "splunk.hec.ack.enabled": "false",
        "splunk.hec.max.batch.size": "5000",
        "splunk.header.support": "true",
        "value.converter": "org.apache.kafka.connect.json.JsonConverter",
        "key.converter": "org.apache.kafka.connect.storage.StringConverter",
        "splunk.header.host": "host",
        "splunk.hec.token": "REDACTED",
        "topics": "5658-Flagr",
        "splunk.hec.use.record.timestamp": "false",
        "splunk.flush.window": "120",
        "splunk.hec.total.channels": "4",
        "splunk.hec.backoff.threshhold.seconds": "120",
        "header.converter": "org.apache.kafka.connect.storage.StringConverter",
        "key.converter.schemas.enable": "false",
        "value.converter.schema.registry.url": "REDACTED",
        "consumer.override.auto.offset.reset": "latest",
        "splunk.hec.ssl.validate.certs": "false",
        "confluent.topic.security.protocol": "SASL_SSL",
        "splunk.sourcetypes": "flagr:container_logging_Prod",
        "value.converter.schemas.enable": "false",
        "name": "5658-Splunk-Flagr",
        "splunk.hec.http.keepalive": "false",
        "consumer.override.partition.assignment.strategy": "org.apache.kafka.clients.consumer.RoundRobinAssignor",
        "consumer.override.client.id": "5658-A-PD-SplunkSink-Flagr",
        "key.converter.schema.registry.url": "REDACTED"
    }
}

Example Logs:

[2022-10-13 02:57:26,270] WARN attempting to resend batch 3084c8c5-fd5f-4530-a889-588bb3e3af2a with 5000 events, this is attempt 1 out of -1 for this batch  (com.splunk.kafka.connect.SplunkSinkTask)
[2022-10-13 02:57:26,272] INFO add 1 failed batches (com.splunk.kafka.connect.SplunkSinkTask)
[2022-10-13 02:57:26,272] INFO total failed batches 1 (com.splunk.kafka.connect.KafkaRecordTracker)
[2022-10-13 02:57:26,272] ERROR failed to send batch 3084c8c5-fd5f-4530-a889-588bb3e3af2a (com.splunk.kafka.connect.SplunkSinkTask)
com.splunk.hecclient.HecException: No channels are available / registered with LoadBalancer
        at com.splunk.hecclient.LoadBalancer.send(LoadBalancer.java:107)
        ...
[2022-10-13 02:57:26,510] WARN attempting to resend batch 3084c8c5-fd5f-4530-a889-588bb3e3af2a with 5000 events, this is attempt 2 out of -1 for this batch  (com.splunk.kafka.connect.SplunkSinkTask)
[2022-10-13 02:57:26,512] INFO add 1 failed batches (com.splunk.kafka.connect.SplunkSinkTask)
[2022-10-13 02:57:26,512] INFO total failed batches 1 (com.splunk.kafka.connect.KafkaRecordTracker)
[2022-10-13 02:57:26,512] ERROR failed to send batch 3084c8c5-fd5f-4530-a889-588bb3e3af2a (com.splunk.kafka.connect.SplunkSinkTask)
com.splunk.hecclient.HecException: No channels are available / registered with LoadBalancer
        at com.splunk.hecclient.LoadBalancer.send(LoadBalancer.java:107)
        ...
[2022-10-13 02:57:26,864] WARN attempting to resend batch 3084c8c5-fd5f-4530-a889-588bb3e3af2a with 5000 events, this is attempt 3 out of -1 for this batch  (com.splunk.kafka.connect.SplunkSinkTask)
[2022-10-13 02:57:26,866] INFO add 1 failed batches (com.splunk.kafka.connect.SplunkSinkTask)
[2022-10-13 02:57:26,866] INFO total failed batches 1 (com.splunk.kafka.connect.KafkaRecordTracker)
[2022-10-13 02:57:26,866] ERROR failed to send batch 3084c8c5-fd5f-4530-a889-588bb3e3af2a (com.splunk.kafka.connect.SplunkSinkTask)
com.splunk.hecclient.HecException: No channels are available / registered with LoadBalancer
        at com.splunk.hecclient.LoadBalancer.send(LoadBalancer.java:107)
        at com.splunk.hecclient.Hec.send(Hec.java:235)
        ...

Is splunk.hec.backoff.threshhold.seconds an ack parameter? I see you have ACK disabled ("splunk.hec.ack.enabled": "false")

I think that error occurs when there are no destinations available in the loadBalancer. Possibly because it been removed due to health check failure?

If you only have 1 destination in “splunk.hec.uri” and its a load balancer, try disabling the health check with “splunk.hec.lb.poll.interval = -1”. (Default 120).

Just a thought though…

Good luck

Is splunk.hec.backoff.threshhold.seconds an ack parameter? I see you have ACK disabled ("splunk.hec.ack.enabled": "false")

If it is, then it should be documented as such.

The way it is written, The amount of time Splunk Connect for Kafka waits to attempt resending after errors from a HEC endpoint. the parameter should take the configured amount of time to backoff sending batches when we receive a HECException.

Is splunk.hec.backoff.threshhold.seconds an ack parameter? I see you have ACK disabled ("splunk.hec.ack.enabled": "false")

If it is, then it should be documented as such.

The way it is written, The amount of time Splunk Connect for Kafka waits to attempt resending after errors from a HEC endpoint. the parameter should take the configured amount of time to backoff sending batches when we receive a HECException.

Agreed, I only ask as I noticed in 'SplunkSinkConnectorConfig.java' it's documented under "// Acknowledgement Parameters"

image

yet in the main readme it's under 'General Optional Parameters'.

looking through the code, it does appear to be a general parameter.

The error you're getting is produced by the following code in LoadBalancer.java

    public synchronized void send(final EventBatch batch) {
        if (channels.isEmpty()) {
            throw new HecException("No channels are available / registered with LoadBalancer");
        } 
  ...

this proceeds any backpressure checks

The exception is thrown because the channels list is empty. I believe channels are removed if a health check fails. Do you have debug enabled and do you see logs such as "Removing channel xxx from loadbalancer"? if so I would try disabling the healthcheck. (assuming you have just one uri configured)

We do see

[2022-10-20 14:16:49,546] INFO HEC is unhealthy, queues are full for indexer REDACTED (com.splunk.hecclient.LoadBalancer)
[2022-10-20 14:16:49,546] INFO healthcheck failed for REDACTED indexer, removing this indexer and its channels from the loadbalancer (com.splunk.hecclient.LoadBalancer)

The HEC url we are using is a load balanced LTM address.

This is interesting, I am going to give it a shot in our development environments.

I would really want to understand the implications of removing the health check. There seems to be value in not attempting to send batches when the Hec channel is reporting down.

I would really want to understand the implications of removing the health check. There seems to be value in not attempting to send batches when the Hec channel is reporting down.

The health check targets the LB. The response the LB provides is taken from one of the indexers (this could be struggling). As far as KC is concerned, the response is representative of the entire health of the URI target (LB), as such that destination is marked as busy and removed from use until a successful health-check is recieved (default 120 seconds)
Clearly not the intention.
2.0.8 added the ability to disable the health check was added by setting the interval to a negative value.

In-line back pressure handling still occurs.

issue reported by @hubert-s, see #324. resolved in 2.0.8