Receive messages from Event Hub kafka timeout

Question

Receive messages from Event Hub kafka timeout

Scarlettliuyc opened this issue 3 years ago · 3 comments

Thanks for filing an issue with us! Below are some guidelines when filing an issue. In general, the more detail the better!

Feature Requests:

What issue are you trying to solve?
-When using Azure Databricks jobs with spark receive messages from Event Hub Kafka got error.
WARN TaskSetManager: Lost task 5.0 in stage xxx (TID xxx) (10.124.0.6 executor 0): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition -partition id could be determined

WARN TaskSetManager: Lost task 6.3 in stage xxx (TID xxx) (10.124.0.13 executor 0): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition -partition id could be determined

How do you want to solve it?
any configuration we could set for the SDK
From this document:
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
We have
maxOffsetsPerTrigger = 650000
Partition 9
readOptions = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
.option("subscribe", TOPIC_DML)
.option("kafka.sasl.mechanism", "PLAIN")
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", EH_SASL_DML)
.option("startingOffsets", startingOffsets)
.option("kafka.request.timeout.ms", 60000)
.option("kafka.session.timeout.ms", 30000)
.option("failOnDataLoss", False)
What is your use case for this feature?
The issue happens intermittently, we cannot determine the issue, any suggestions for the setting?
Bug Report:
Actual behavior
Kafka timeout when set the offset partions. It always happens after run the Stream job 3-5 days
Expected behavior
Avoid No timeout warnings

Answer 1 · 2022-11-03T17:08:09.000Z

I have exactly the same issue using Kafka API of EventHub (but after some minutes to hours). Did you try to do the same with this connector (which uses EventHub native API) ?
In my case, I was able to produce a network capture file of the traffic. We can observe that the broker do not answer to the Kafka ListOffsetRequest used by Spark to create a microbatch.
So the issue is probably a bug in EventHub Kafka API. A ticket has been raised to the support.

Answer 2 · 2022-11-06T00:41:56.000Z

@sbarnoud Can you keep us in the loop since I am also seeing the same thing.

Answer 3 · 2022-11-09T09:12:52.000Z

So in our use case, Azure support identify that the root cause is not EventHub but our network setting. To summarize, we have a Hub&Spoke architecture where BGP announces of some CIDR were routing this part of the traffic to an Express Route Gateway instead of Azure Firewall.
Routing correctly to EventHub solves our issue.

Notice: before that, we set the TCP keepalive of all VM in order to avoid any "idle timeout" (like the one of the fire wall):

net.ipv4.tcp_keepalive_time=180
net.ipv4.tcp_keepalive_intvl = 180
net.ipv4.tcp_keepalive_probes = 480