Streaming Large Kafka Topic

Question

Streaming Large Kafka Topic

Closed this issue 5 years ago · 2 comments

Hello,

I have a setup where the streaming input is kafka and the output is hudi to S3 bucket.
When the topic is a large data-set I'm getting the following message:

This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

This is what I have set, for the streaming section, in the config.yaml file:

streaming:
triggerMode: ProcessingTime
triggerDuration: 20 seconds
outputMode: append
checkpointLocation: s3://some_bucket_name
batchMode: true

I increased the max.poll.interval.ms property in kafka configs but I was still getting the above.
I also tried setting:
extraOptions:
maxRatePerPartition: 1000
batchDuration: 10

Correct me if I'm wrong, but, is this related to the batch size of the consumer? Is there a way to set a batch size in the config?

Thank you

Answer 1 · 2020-01-22T15:48:05.000Z

I think I figured it out. I used maxOffsetsPerTrigger under input options and ProcessingTime for the triggerMode under streaming.
What is the difference between ProcessingTime and Continuous triggerMode?

Answer 2 · 2020-01-23T23:28:07.000Z

Hi,

I have a kafka topic with 20 partitions. Is there a way to configure hudi to read from all partitions in parallel?

Thank you