Incorrect handling of Cassandra Snappy compressed SSTables.
danchia opened this issue · 5 comments
Aegisthus will actually miss out rows in compressed SSTables, as it does not correctly track the sstable file size.
The SSTableRecordReader tracks the position in file to determine if there are still records to read. However, based on SSTableReader it's tracking the position in the uncompressed stream.
However, in AegSplit after initializing the compressedInputStream, we don't adjust end to be the actual uncompressed stream length (available in CompressionMetadata).
I sort of have a working patch, will submit in a PR in a few days.
Awesome. I wrote the compressed reader as a proof of concept when we were considering adopting them a couple of years ago. But we haven't actually adopted them, so I haven't had a chance to verify the data against any real cases.
Did this get merged?
The pull request was never made so I fixed this in another pull that is now merged. This commit for reference:
I believe this issue is closed and I am going to go ahead and mark it now. There is another project that delves further into compressed SSTables: https://github.com/fullcontact/hadoop-sstable which is probably even better. Netflix doesn't have a lot of compressed tables so we haven't had a need to optimize this use case.
Ack, I totally forgot to get around to making the PR..
I've looked through the commit referenced, and that indeed should fix the problem I reference.
Going to close this since it should be fixed now.