mmolimar/kafka-connect-fs

Same file published multiple times / duplicated

wellygee opened this issue · 13 comments

I managed to get the plugin working, however I am not sure the same file keeps getting published multiple times before being moved. I was expecting that the file will e processed once and gets moved to the finished folder:

FixedWidthReader config

name=FsSourceConnector
connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector
tasks.max=1
fs.uris=/etc/kafka/spooldir/data
topic=fsc-orders
policy.class=com.github.mmolimar.kafka.connect.fs.policy.CronPolicy
policy.regexp=.*
policy.batch_size=0
policy.cleanup=move
policy.cleanup.move=/etc/kafka/spooldir/finished
policy.recursive=true
policy.cron.expression=0/2 * * * * ?
policy.regexp=^.*.txt$
file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.FixedWidthFileReader
file_reader.batch_size=0
file_reader.delimited.settings.field_lengths=3,5,2
file_reader.delimited.settings.header=false
file_reader.delimited.settings.header_names=col1,col2,col3
file_reader.delimited.settings.schema=long,string,string

File entries:

895ZfffftP
933ffffZZP

Hi @mmolimar - thanks for your response. How would I commit the off-sets - I have tried to poll more often that the offset.flush.interval.ms value without success

What value did you set to offset.flush.internal.ms?

Have you tried changing the cron expression, for instance, like every minute or so?

I have tried a few combinations. Currently I have offset.flush.interval.ms=10000 and policy.cron.expression=* 0/5 * * * ?, still no luck

Don't you see in the Kafka Connect logs that the offsets are committed? If so, do you still get more and more messages?

Yes, I do see more and more messages however they are being committed:

image

How often do you get new messages in the topic?

From the cron expression I am expecting them to be pushed to the topic at most once every 5 minutes. I am obviously using the fixed with source reader

and how often do you see the executions and get new messages?

Once the source file has been picked up by the reader, I get all the file entries republished every second for close to 10 seconds. Only after 20 seconds does the file get moved to the done location and the messages stop publishing

So probably it's about the cron expression. Can you change it?

Thanks a lot @mmolimar - changing the cron expression to 1 0/1 * ? * * * achieved in the behavior I was expecting

Great :-)

Can you close this issue please?