Netflix/aegisthus

Deleted columns reach output in case of non-deleted rows.

rzvoncek opened this issue · 2 comments

Some versions of Cassandra (1.2 for sure, 1.1 possibly) initialize the "deleted at" field for rows in SSTables to Long.MIN_VALUE (which is a very large negative number).

When the CassReducer checks for deleted columns, it is comparing column deletion timestamps to the row deletion timestamp. In case the row has not been deleted, its "deleted at" will be Long.MIN_VALUE and the appropriate condition will not hold (signed negative number deletedAt is not larger than column deletion time ts).

This way, deleted columns will find their way to the output.

We have to leave the tombstones in the output. You are right that the deleted columns will not be removed. But since we are updating rows in a completely eventually consistent way we cannot guarantee that the tombstones aren't needed. In our case we use the PigLoader for the data to filter out deleted columns (although in debugging cases we will load them as well) to keep the downstream clear of tombstones.

So I don't think this is actually a problem. We could make an enhancement to remove deleted columns optionally, if that might make sense?

Closing this as I believe Aegisthus now removes the deleted columns See

If you don't feel this is sufficient we can add an entry to the enhancements section of the README referencing this request.