The plaintext experiments are done with PlainPRL.py
.
Unzip the datasets into folders "NCVR" (North Carolina) and "bnb_tpl_datasets"(bnb_tpl), and put them in the same directory as PlainPRL.py
The -t
option specifies either the test uses bnb_tpl
dataset or ncvr
dataset.
The -n
option specifies the number of samples we select for both parties. If not specified, experiments will be done for various number of samples, specified by n_sample_range
in the main()
function.
The parameters we varied include: length of the vector of min-hash values, main()
function.
The length of the vector of min-hash values, num_perms_per_hash_range
, which lists the set of values we want to test.
The number of duplicates of each records, num_duplicates_range
, which lists the set of values we want to test.
The number of bins per duplicate, num_bins_per_duplicate
, which lists the set of values we want to test.
The input datasets should be in two folders, bnb_tpl_datasets
and NCVR
.
The output files will be in csv format, with columns specifying some of the experiment parameters, and their corresponding number of comparisons required, and resulting accuracy.
We need the ABY framework (https://github.com/encryptogroup/ABY.git) to run our cryptographic experiments.
The folder my_psi
should be placed in the ABY/src/examples/
folder. The CMakeLists.txt
in the same folder needs one extra line add_subdirectory(my_psi)
.
Then, make the project as instructed in ABY.
Copy the my_psi/common/gen.py
to ABY/build/bin
and run python3 gen.py
. It generates inp0
and inp1
. The datasets we used were also included in my_psi/common/inp0
and my_psi/common/inp1
.
Copy the my_psi/common/measure.py
to ABY/build/bin
and run python3 measure.py
. This will start the experiment measuring the number of comparisons and runtime.
On lines measure.py
there are some parameters.
run
specifies how many times we should repeat the experiment. Here it is a list of length
num_eles
is the number of input for both parties. Here it is in the range between
num_bins
is the number of bins per duplicate. It is fixed to
The output files will be in csv format, with columns specifying some of the experiment parameters, and their corresponding CPU time used, and number of comparisons done.