Same file published multiple times / duplicated
wellygee opened this issue · 13 comments
I managed to get the plugin working, however I am not sure the same file keeps getting published multiple times before being moved. I was expecting that the file will e processed once and gets moved to the finished folder:
FixedWidthReader config
name=FsSourceConnector
connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector
tasks.max=1
fs.uris=/etc/kafka/spooldir/data
topic=fsc-orders
policy.class=com.github.mmolimar.kafka.connect.fs.policy.CronPolicy
policy.regexp=.*
policy.batch_size=0
policy.cleanup=move
policy.cleanup.move=/etc/kafka/spooldir/finished
policy.recursive=true
policy.cron.expression=0/2 * * * * ?
policy.regexp=^.*.txt$
file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.FixedWidthFileReader
file_reader.batch_size=0
file_reader.delimited.settings.field_lengths=3,5,2
file_reader.delimited.settings.header=false
file_reader.delimited.settings.header_names=col1,col2,col3
file_reader.delimited.settings.schema=long,string,string
File entries:
895ZfffftP
933ffffZZP
Hi @wellygee
Maybe it's because you didn't commit the offsets?
Hi @mmolimar - thanks for your response. How would I commit the off-sets - I have tried to poll more often that the offset.flush.interval.ms
value without success
What value did you set to offset.flush.internal.ms
?
Have you tried changing the cron expression, for instance, like every minute or so?
I have tried a few combinations. Currently I have offset.flush.interval.ms=10000
and policy.cron.expression=* 0/5 * * * ?
, still no luck
Don't you see in the Kafka Connect logs that the offsets are committed? If so, do you still get more and more messages?
How often do you get new messages in the topic?
From the cron expression I am expecting them to be pushed to the topic at most once every 5 minutes. I am obviously using the fixed with source reader
and how often do you see the executions and get new messages?
Once the source file has been picked up by the reader, I get all the file entries republished every second for close to 10 seconds. Only after 20 seconds does the file get moved to the done location and the messages stop publishing
So probably it's about the cron expression. Can you change it?
Thanks a lot @mmolimar - changing the cron expression to 1 0/1 * ? * * *
achieved in the behavior I was expecting
Great :-)
Can you close this issue please?