AutoMQ/automq

[E2E] group_mode_transactions_test failed with InvalidProducerEpochException

Closed this issue · 0 comments

test_id: tests/kafkatest/tests/core/group_mode_transactions_test.py::GroupModeTransactionsTest.test_transactions@{"failure_mode":"hard_bounce","bounce_target":"brokers"}

client error:

[2024-02-06 16:19:16,982] DEBUG [Producer clientId=producer-copier-1, transactionalId=copier-1] Sending transactional request TxnOffsetCommitRequestData(transactionalId='copier-1', groupId='grouped-transactions-test-consumer-group', producerId=1001, producerEpoch=0, generationId=4, memberId='consumer-grouped-transactions-test-consumer-group-1-0604a14d-04b4-45cb-8901-3d20d50ee384', groupInstanceId=null, topics=[TxnOffsetCommitRequestTopic(name='input-topic', partitions=[TxnOffsetCommitRequestPartition(partitionIndex=4, committedOffset=11111, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=2, committedOffset=12750, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=3, committedOffset=17389, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=0, committedOffset=10639, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=1, committedOffset=11111, committedLeaderEpoch=-1, committedMetadata='')])]) to node ducker08:9092 (id: 2 rack: null) with correlation ID 123537 (org.apache.kafka.clients.producer.internals.Sender)
[2024-02-06 16:19:17,002] TRACE [Producer clientId=producer-copier-1, transactionalId=copier-1] Received transactional response TxnOffsetCommitResponseData(throttleTimeMs=0, topics=[TxnOffsetCommitResponseTopic(name='input-topic', partitions=[TxnOffsetCommitResponsePartition(partitionIndex=4, errorCode=47), TxnOffsetCommitResponsePartition(partitionIndex=2, errorCode=47), TxnOffsetCommitResponsePartition(partitionIndex=3, errorCode=47), TxnOffsetCommitResponsePartition(partitionIndex=0, errorCode=47), TxnOffsetCommitResponsePartition(partitionIndex=1, errorCode=47)])]) for request TxnOffsetCommitRequestData(transactionalId='copier-1', groupId='grouped-transactions-test-consumer-group', producerId=1001, producerEpoch=0, generationId=4, memberId='consumer-grouped-transactions-test-consumer-group-1-0604a14d-04b4-45cb-8901-3d20d50ee384', groupInstanceId=null, topics=[TxnOffsetCommitRequestTopic(name='input-topic', partitions=[TxnOffsetCommitRequestPartition(partitionIndex=4, committedOffset=11111, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=2, committedOffset=12750, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=3, committedOffset=17389, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=0, committedOffset=10639, committedLeaderEpoch=-1, committedMetadata=''), TxnOffsetCommitRequestPartition(partitionIndex=1, committedOffset=11111, committedLeaderEpoch=-1, committedMetadata='')])]) (org.apache.kafka.clients.producer.internals.TransactionManager)
[2024-02-06 16:19:17,002] DEBUG [Producer clientId=producer-copier-1, transactionalId=copier-1] Received TxnOffsetCommit response for consumer group grouped-transactions-test-consumer-group: {input-topic-4=INVALID_PRODUCER_EPOCH, input-topic-2=INVALID_PRODUCER_EPOCH, input-topic-3=INVALID_PRODUCER_EPOCH, input-topic-0=INVALID_PRODUCER_EPOCH, input-topic-1=INVALID_PRODUCER_EPOCH} (org.apache.kafka.clients.producer.internals.TransactionManager)
[2024-02-06 16:19:17,002] INFO [Producer clientId=producer-copier-1, transactionalId=copier-1] Transiting to fatal error state due to org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch. (org.apache.kafka.clients.producer.internals.TransactionManager)
[2024-02-06 16:19:17,002] DEBUG [Producer clientId=producer-copier-1, transactionalId=copier-1] Transition from state IN_TRANSACTION to error state FATAL_ERROR (org.apache.kafka.clients.producer.internals.TransactionManager)
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
[2024-02-06 16:19:17,002] ERROR [Producer clientId=producer-copier-1, transactionalId=copier-1] Aborting producer batches due to fatal error (org.apache.kafka.clients.producer.internals.Sender)
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
[2024-02-06 16:19:17,003] TRACE Aborting batch for partition output-topic-2 (org.apache.kafka.clients.producer.internals.ProducerBatch)
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.

server error:

[2024-02-06 16:19:16,995] ERROR [ReplicaManager broker=2] Error processing append operation on partition __consumer_offsets-2 (kafka.server.ReplicaManager)
org.apache.kafka.common.errors.InvalidProducerEpochException: Epoch of producer 1001 at offset 1125 in __consumer_offsets-2 is 0, which is smaller than the last seen epoch 1

kafka-request.log:

[2024-02-06 16:19:16,916] DEBUG Completed request:{"isForwarded":false,"requestHeader":{"requestApiKey":27,"requestApiVersion":1,"correlationId":169913,"clientId":"broker-1-txn-marker-sender","requestApiKeyName":"WRITE_TXN_MARKERS"},"request":{"markers":[{"producerId":1001,"producerEpoch":1,"transactionResult":false,"topics":[{"name":"__consumer_offsets","partitionIndexes":[2]}],"coordinatorEpoch":2}]},"response":{"markers":[{"producerId":1001,"topics":[{"name":"__consumer_offsets","partitions":[{"partitionIndex":2,"errorCode":0}]}]}]},"connection":"10.5.0.10:9092-10.5.0.9:53844-1","totalTimeMs":41.607,"requestQueueTimeMs":0.013,"localTimeMs":41.484,"remoteTimeMs":0.0,"throttleTimeMs":0,"responseQueueTimeMs":0.039,"sendTimeMs":0.07,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"unknown","softwareVersion":"unknown"}} (kafka.request.logger)

What happened

GroupCoordinator will fence the related producer if its transaction is time-out. A 'WRITE_TXN_MARKERS' request is dispatched to the relevant partitions, causing a direct increase in the producer's epoch from 0 to 1. Subsequently, when the producer attempts to send records with epoch 0, it encounters an InvalidProducerEpochException.