rom1504/cc2dataset

Investigate using parquet bloom filter to reduce size on disk

rom1504 opened this issue · 10 comments

https://github.com/apache/parquet-format/blob/master/BloomFilter.md

Could it help to do deduplication?
Current strategy requires storing about 2TB per 100k wat. Could it be made better using this ?

Parquet one unclear

https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/util/sketch/BloomFilter.html spark one maybe

Algo could be:

  • Compute part 1
  • count item in part 1
  • compute bloom filter of part 1
  • compute part 2 while discarding item in bloom filter
  • Count item in part 1 and 2
  • build bloom filter of part 2, merge in bloom filter of part 1
  • same as part 2 for part 3
  • merge all parts

This decreases the space required for merging to only the uniques

Merging may also be an optional step in that scenario since collection would already be unique thanks to bloom filter. Random shuffling can be reserved to another job, potentially running with a different cluster

A 100B bloom filter with fp rate 3% is 90GB
This option is possible assuming enough ram

10% cc succeeded with current code and 10TB of total nvme. end result is 3TB of parquet (maybe 5TB uncompressed)
That seems to say that most likely with 100TB of total nvme, whole cc would succeed, so eg 100 instances
let's try and figure out if bloom filter or different merging strategies could help reduce this

actually the best path here is simply to get more local space. aws has 8TB local drive instances.

still is bloom filter available in python ?

answer is no

no let's do #18 instead