Investigate using parquet bloom filter to reduce size on disk
rom1504 opened this issue · 10 comments
https://github.com/apache/parquet-format/blob/master/BloomFilter.md
Could it help to do deduplication?
Current strategy requires storing about 2TB per 100k wat. Could it be made better using this ?
Parquet one unclear
https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/util/sketch/BloomFilter.html spark one maybe
Algo could be:
- Compute part 1
- count item in part 1
- compute bloom filter of part 1
- compute part 2 while discarding item in bloom filter
- Count item in part 1 and 2
- build bloom filter of part 2, merge in bloom filter of part 1
- same as part 2 for part 3
- merge all parts
This decreases the space required for merging to only the uniques
Merging may also be an optional step in that scenario since collection would already be unique thanks to bloom filter. Random shuffling can be reserved to another job, potentially running with a different cluster
A 100B bloom filter with fp rate 3% is 90GB
This option is possible assuming enough ram
10% cc succeeded with current code and 10TB of total nvme. end result is 3TB of parquet (maybe 5TB uncompressed)
That seems to say that most likely with 100TB of total nvme, whole cc would succeed, so eg 100 instances
let's try and figure out if bloom filter or different merging strategies could help reduce this
actually the best path here is simply to get more local space. aws has 8TB local drive instances.
still is bloom filter available in python ?
answer is no