Implementation of a Bloom Filter using the Spark framework in Python. All the documentation for this project can be found here
In order to run the algorithm on a cluster, you have to install the same python environment on all the cluster's machines.
These are the steps to create a virtual environment with all the needed packages:
-
Only on debian/ubuntu install python venv
sudo apt-get install python3-venv
-
create a virtual environment and activate it
python -m venv pyspark_venv source pyspark_venv/bin/activate
-
install all the needed dependencies using pip
pip3 install pyspark mmh3 venv-pack
-
zip the virtual environment in a .tar.gz archive
venv-pack -o pyspark_venv.tar.gz
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment main.py [input path] [output path] [false positive rate]
spark-submit --archives pyspark_venv.tar.gz#environment main.py data/data.txt output 0.01
Note: remember to delete the output folder before the next execution using hadoop fs -rm -r [output path]
We have tested the bloom filter with the title.ratings IMDb dataset (available here).
The dataset is a .tsv
file, with an header line containing the structure of the data:
tconst averageRating numVotes
We transform this file in a .txt
file without the header.
tt0000001 5.7 1882
tt0000002 5.9 250
tt0000003 6.5 1663
tt0000004 5.8 163
tt0000005 6.2 2487
tt0000006 5.2 166
tt0000007 5.4 773
tt0000008 5.4 2024
tt0000009 5.3 194
tt0000010 6.9 6803
tt0000011 5.3 346
tt0000012 7.4 11692
tt0000013 5.7 1801
... ... ...
In the setup phase, in which we compute m for each bloom filter, we also pre-process this dataset, removing the numVotes
column and rounding the values in the averageRating
column.
You can find the data.txt
file (and a reduced version of the same file used for testing purposes data1.txt
) in this folder.
The algorithm generates two output folders: bloom_filters
, with a bloom filter for each vote, and fp_rates
, with the number of false positives and the false positive rate computed for each bloom filter.
You can find these folders inside the <output path>
specified as input parameter; for example, if you specify output
as output path, the output structure will be:
output
|
|_ bloom_filters
|
|_ fp_rates
This is an example of the output printed by the execution of the application, using 0.01 as p value
***** results *****
vote 1 --> false positives: 12481, false positive rate: 0.01002832292147922
vote 2 --> false positives: 12872, false positive rate: 0.010385712383864707
vote 3 --> false positives: 12228, false positive rate: 0.009938215314994612
vote 4 --> false positives: 11999, false positive rate: 0.01002203360667924
vote 5 --> false positives: 11349, false positive rate: 0.009860438606627667
vote 6 --> false positives: 9953, false positive rate: 0.009960091585208669
vote 7 --> false positives: 9192, false positive rate: 0.010166252289397545
vote 8 --> false positives: 8837, false positive rate: 0.010116007262187402
vote 9 --> false positives: 11843, false positive rate: 0.010266826525048092
vote 10 --> false positives: 12337, false positive rate: 0.010021908963666214
You can find our implementation of the bloom filter using Hadoop in this repository