Azure/azure-event-hubs-spark

Receive messages from Event Hub kafka timeout

Scarlettliuyc opened this issue · 3 comments

Thanks for filing an issue with us! Below are some guidelines when filing an issue. In general, the more detail the better!

Feature Requests:

  • What issue are you trying to solve?
    -When using Azure Databricks jobs with spark receive messages from Event Hub Kafka got error.
    WARN TaskSetManager: Lost task 5.0 in stage xxx (TID xxx) (10.124.0.6 executor 0): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition -partition id could be determined

WARN TaskSetManager: Lost task 6.3 in stage xxx (TID xxx) (10.124.0.13 executor 0): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition -partition id could be determined

  • How do you want to solve it?

  • any configuration we could set for the SDK

  • From this document:

  • https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

  • We have
    maxOffsetsPerTrigger = 650000
    Partition 9
    readOptions = spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("subscribe", TOPIC_DML)
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.jaas.config", EH_SASL_DML)
    .option("startingOffsets", startingOffsets)
    .option("kafka.request.timeout.ms", 60000)
    .option("kafka.session.timeout.ms", 30000)
    .option("failOnDataLoss", False)

  • What is your use case for this feature?
    The issue happens intermittently, we cannot determine the issue, any suggestions for the setting?
    Bug Report:

  • Actual behavior

  • Kafka timeout when set the offset partions. It always happens after run the Stream job 3-5 days

  • Expected behavior

  • Avoid No timeout warnings

I have exactly the same issue using Kafka API of EventHub (but after some minutes to hours). Did you try to do the same with this connector (which uses EventHub native API) ?
In my case, I was able to produce a network capture file of the traffic. We can observe that the broker do not answer to the Kafka ListOffsetRequest used by Spark to create a microbatch.
So the issue is probably a bug in EventHub Kafka API. A ticket has been raised to the support.

@sbarnoud Can you keep us in the loop since I am also seeing the same thing.

So in our use case, Azure support identify that the root cause is not EventHub but our network setting. To summarize, we have a Hub&Spoke architecture where BGP announces of some CIDR were routing this part of the traffic to an Express Route Gateway instead of Azure Firewall.
Routing correctly to EventHub solves our issue.

Notice: before that, we set the TCP keepalive of all VM in order to avoid any "idle timeout" (like the one of the fire wall):

  • net.ipv4.tcp_keepalive_time=180
  • net.ipv4.tcp_keepalive_intvl = 180
  • net.ipv4.tcp_keepalive_probes = 480