logstash-plugins/logstash-output-kafka

Retry indefinitely

jordansissel opened this issue · 4 comments

The defaults in this plugin set retries => 0. Further, the KafkaProducer has no mechanism for infinite retry, and recommends this:

Note that the above example may drop records if the produce request fails. If we want to ensure that this does not occur we need to set retries=<large_number> in our config.

"Large number" is probably not a great default because it still allows for data loss.

The defaults in this plugin are not what we want Logstash to do out of the box. This default (zero retry) means Logstash loses data through the Kafka Producer during any network fault or Kafka fault.

Proposal:

Change the default behavior (breaking change) to retry until successful.

  • Change the default retries to be nil with the implication of infinite retry (breaking change)
  • KafkaProducer.send() returns a Future that we can inspect for success.
  • If @retries is nil (proposed default), any failed send()'s must be retried
  • If @retries is a number, retry failed send() that number of times.
  • Do not use send() asynchronously. Always check the Future's result.

Compatibility

This proposal changes only the default behavior which, as proposed, is to retry indefinitely.

Any other value of retry will still work the same as it did before (behavior-wise). I am of the opinion that it is incorrect to set retry at all, but debating the merits of retry-then-give-up is outside the scope of this issue.

Logstash's practice has always been to default to never dropping data and resolving delivery failures by retrying until successful. The current behavior causes the Kafka client to drop data any time there's a downstream problem, and this is a bug.

I'm open to discussion on whether this is a bug or a breaking change.

I agree this is a bug and isn't a good default. If we change the default retries, it still wouldn't be a breaking change IMO since the option is still there and any user specified value in their config would still work on upgrade.

I need to think about how we can handle this asynchronously. Maybe DLQ is an option here upon unsuccessful delivery after x retries? Open to discuss this.

since the option is still there and any user specified value in their config would still work on upgrade.

I think we can make retries work correctly when the user sets it. I think my proposal more clearly is:

  • when retries => ... is set by the user, we retry at most that many times before abandoning.
  • when retries => ... is default, we will retry indefinitely
  • retries => 0 still means never retry but now (proposed) the default will be indefinite retry.

I'll have a PR for this shortly.