LSH Ensemble

This is an assignment for the Big Data course in Roma Tre University.

This repo is based on the work reported in this paper: LSH Ensemble: Internet-Scale Domain Search.

Requirements

To run this project you need:

Python 3.6.9
Hadoop 3.2.1
Spark 3.0.0
pip3 intstalled in your machine. To install pip3 run the following commands in a shell

sudo apt update
sudo apt install python3-pip

Start Hadoop, open a shell and run

$HADOOP_HOME/sbin/start-dfs.sh

Download this repo or clone it by running

git clone https://github.com/ebtelmarz/big_data_lsh_ensemble.git

Move inside the downloaded directory

cd big_data_lsh_ensemble/

Execute the run.sh script by running in a shell

sh run.sh

Create a virtual environment

python3 -m venv my_env
source .my_env/bin/activate

Execute the run.sh script by running

sh run.sh