pinterest/secor

Dropping last message or two before new parquet writer is created

jeremyplichtafc opened this issue · 2 comments

We are using the AvroMessageParser and AvroParquetFileReaderWriterFactory and have noticed that a very small amount of messages are being dropped. Upon further investigation the sequence numbers of the messages being dropped correspond to the number right before (or sometimes 2 before) one of the files that was written to S3.

Ex:
If one of the files on s3 is named: 1_1_00000000002329440769.gz.parquet (which I take to mean that the first piece of data in that file was from partition 1 with offset 2329440769), then the data which was dropped was in offset 2329440768.

The previous file I would have expected it to be in is well under our max file size param so I think it is getting finalized/written due to reaching max file age.

I will try to investigate more and see if I can write a unit test and figure out what is going on. If it turns out this is somehow related to our setup/config I'll add more detail here.

We are running of a fairly recent version we built off master: 359c8b8

Thanks,
Jeremy

Thanks for the tips on how to troubleshoot. I'll let you know what I find. And if there is an apparent fix I'll send a PR your way.