Investigate using parquet bloom filter to reduce size on disk

Question

Investigate using parquet bloom filter to reduce size on disk

rom1504 opened this issue 2 years ago · 10 comments

https://github.com/apache/parquet-format/blob/master/BloomFilter.md

Could it help to do deduplication?
Current strategy requires storing about 2TB per 100k wat. Could it be made better using this ?

Answer 1 · 2022-12-01T11:44:42.000Z

Parquet one unclear

https://spark.apache.org/docs/3.2.0/api/java/org/apache/spark/util/sketch/BloomFilter.html spark one maybe

Algo could be:

Compute part 1
count item in part 1
compute bloom filter of part 1
compute part 2 while discarding item in bloom filter
Count item in part 1 and 2
build bloom filter of part 2, merge in bloom filter of part 1
same as part 2 for part 3
merge all parts

This decreases the space required for merging to only the uniques

Answer 2 · 2022-12-01T11:46:58.000Z

Merging may also be an optional step in that scenario since collection would already be unique thanks to bloom filter. Random shuffling can be reserved to another job, potentially running with a different cluster

Answer 3 · 2022-12-01T11:57:53.000Z

A 100B bloom filter with fp rate 3% is 90GB
This option is possible assuming enough ram

Answer 4 · 2022-12-01T17:36:22.000Z

10% cc succeeded with current code and 10TB of total nvme. end result is 3TB of parquet (maybe 5TB uncompressed)
That seems to say that most likely with 100TB of total nvme, whole cc would succeed, so eg 100 instances
let's try and figure out if bloom filter or different merging strategies could help reduce this

Answer 5 · 2022-12-01T18:14:16.000Z

https://medium.com/@songkunjump/parquet-bloom-filter-with-spark-495a5f019c6c

Answer 6 · 2022-12-01T20:41:24.000Z

actually the best path here is simply to get more local space. aws has 8TB local drive instances.

Answer 7 · 2022-12-01T20:46:19.000Z

https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/util/sketch/BloomFilter.html

Answer 8 · 2022-12-01T20:46:30.000Z

still is bloom filter available in python ?

Answer 9 · 2022-12-01T21:08:22.000Z

answer is no

Answer 10 · 2022-12-03T22:25:31.000Z

no let's do #18 instead