
MapReduce implementation of LSH Ensemble

Primary LanguagePython

LSH Ensemble

This is an assignment for the Big Data course in Roma Tre University.

This repo is based on the work reported in this paper: LSH Ensemble: Internet-Scale Domain Search.


To run this project you need:

  • Python 3.6.9
  • Hadoop 3.2.1
  • Spark 3.0.0
  • pip3 intstalled in your machine. To install pip3 run the following commands in a shell
sudo apt update
sudo apt install python3-pip


To run the project locally

Start Hadoop, open a shell and run


Download this repo or clone it by running

git clone https://github.com/ebtelmarz/big_data_lsh_ensemble.git

Move inside the downloaded directory

cd big_data_lsh_ensemble/

Execute the run.sh script by running in a shell

sh run.sh


To run the project on cluster

Create a virtual environment

python3 -m venv my_env
source .my_env/bin/activate 

Execute the run.sh script by running

sh run.sh