Azure/azure-event-hubs-spark

ReceiverDisconnectedException even if using different consumer groups

HaowenZhangBD opened this issue · 1 comments

Hi team, we have seen the ReceiverDisconnectedException in our databricks env and done some research.
Found other people have similar problem and solved in these 2 docs

https://github.com/Azure/azure-event-hubs-spark/blob/master/FAQ.md
https://github.com/Azure/azure-event-hubs-spark/blob/master/examples/multiple-readers-example.md

We have read through them and follow the suggestions of using different consumer groups for different stream.
But we still get ReceiverDisconnectedException on both of the stream in the Similar timestamp
image

Bug Report:

  • Actual behavior

stream 1 using PATH: publisher-events-eh/ConsumerGroups/job1/Partitions/0

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5065.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5065.0 (TID 91438) (10.139.64.4 executor driver): java.util.concurrent.CompletionException: com.microsoft.azure.eventhubs.ReceiverDisconnectedException: New receiver 'spark-driver-87' with higher epoch of '0' is created hence current receiver 'spark-driver-87' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:581a6d040004c849000eef7c64ddd416_G27_B39, SystemTracker:OUR EVENTHUB:publisher-events-eh~1023|job1, Timestamp:2023-08-17T08:02:35, errorContext[NS: OUR EVENTHUB, PATH: publisher-events-eh/ConsumerGroups/job1/Partitions/0, REFERENCE_ID: LN_a37906_1692259345344_1af_G27, PREFETCH_COUNT: 500, LINK_CREDIT: 1000, PREFETCH_Q_LEN: 0]

stream 2 using PATH: publisher-events-eh/ConsumerGroups/machine2/Partitions/0

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5069.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5069.0 (TID 91503) (10.139.64.4 executor driver): java.util.concurrent.CompletionException: com.microsoft.azure.eventhubs.ReceiverDisconnectedException: New receiver 'spark-driver-315' with higher epoch of '0' is created hence current receiver 'spark-driver-315' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:581a6d040006c849000eef5c64ddd416_G2_B39, SystemTracker:OUR EVENTHUB:publisher-events-eh~1023|machine2, Timestamp:2023-08-17T08:02:35, errorContext[NS: OUR EVENTHUB, PATH: publisher-events-eh/ConsumerGroups/machine2/Partitions/0, REFERENCE_ID: LN_190e6e_1692259345190_e97a_G2, PREFETCH_COUNT: 500, LINK_CREDIT: 1000, PREFETCH_Q_LEN: 0]

  • Expected behavior : no ReceiverDisconnectedException
  • spark-eventhubs artifactId and version : com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.22
  • Spark version
    image

Maybe worth mention: Another Environment, applying the same code change, didn't have ReceiverDisconnectedException after running for around 1 day