confluentinc/parallel-consumer

Sometimes a a transaction error occurs - Cannot call send in state COMMITTING_TRANSACTION

astubbs opened this issue · 3 comments

While running and publishing messages back to Kafka (pollAndProduce), sometimes a transaction error occurs in the log:

java.lang.IllegalStateException: Cannot call send in state COMMITTING_TRANSACTION

This is due to some error in the way transaction state is managed / monitored by the system.

Also, as reported by a user:

It doesn’t get stuck on Cannot call send in state COMMITTING_TRANSACTION, it just re-processes.
However, sometimes I get Invalid transition attempted from state READY to state COMMITTING_TRANSACTION on startup and then an infinite loop occurs. I’ve only seen this happening when I set max.poll.records very low.

I've contributed a small test-case that reproduces some of the issues we discussed on slack: JorgenRingen@9d5fd91

By tweaking the parallel-consumer and kafka-consumer settings I can reproduce Invalid transition attempted from state READY to state COMMITTING_TRANSACTION. Should happen on maxPolledRecords .

Also reproduced messagesProcessed > messagesProduced and Cannot call send in state COMMITTING_TRANSACTION (as the test is now).

I get somewhat different behavior by tweaking parallel-consumer options and the max.poll.records. It's not completely deterministic.

Afternoon @JorgenRingen, I believe I have fixed the issue with transactions / commit state, and now made transactions optional as well: https://github.com/confluentinc/parallel-consumer/pull/31/files

Take a look at the interface and let me know what you think - specifically this choice: https://github.com/confluentinc/parallel-consumer/pull/31/files#diff-12c3d1f966a5367ab47be49f7c0f11a9dcd4cf339b73acb092b8c44ae9c9a6e2R45

Evening @astubbs, great with optional transactions. Adding the parameter and verifying by IllegalArugmentException'ing seems like a pragmatic and fair approach to me. However, if introducing support for optional producer, the parameter might be a little confusing. Don't have any immediate ideas on how to improve. Maybe some "fluent style" perhaps (which might be totally overkill) like ParallelConsumerOptions.with[outProducer|NonTransactionProducer|TransactionalProducer]().numberOfThreads(...).maxConcurrency(...)

I actually found a bug in the test:
https://github.com/confluentinc/parallel-consumer/pull/31/files#diff-d6d31b42ae96a5e31f2793c52624720af3a084c434707a9f031969a7af1b4e14R96 <- this line would always override maxPollRecords instrumented by the tests.

I deleted the line, added a couple of more tests and better verification and error-messages.
In my branch the following tests now fail fairly consistently:

  • io.confluent.parallelconsumer.examples.core.Bug25AppTest#testTransactionalLowMaxPoll (1) (infinite loop on every run)
  • io.confluent.parallelconsumer.examples.core.Bug25AppTest#testTransactionalDefaultMaxPoll (500) (infinite loop about ~50% of the times)

Typically infinite loops occurs when processed - produced=16 (16 = default number of threads)

All non-tx tests works and tx works when running with maxPollRecords=10000.

Checkout updated test:
JorgenRingen@be57d92