vmware-archive/hillview

Redundant Data Loaded from reading SSTable

Closed this issue · 1 comments

The flights dataset should be around 140M, but Hillview + Cassandra loads 315M rows. I will provide more updates on this matter after done more testing. With a smaller dataset (2M), the loading is correct. I suspect this duplication issue is due to inconsistent dataLocality from SSTableUtil (which is run by each worker locally).

The solution seems to be to dump a snapshot of the data before loading it.