/document-similarity

Detecting similarity between documents using hashing.

Primary LanguageC++

Document Similarity

This project is part of the Algorithms course (fall 2016) at Facultat d'Informàtica de Barcelona (UPC - BarcelonaTech). Done together with Víctor Massagué & Rubén Marías.

The project full documentation is on docs/ folder, with both our explanations and conclusions and the original assignment from course professors.

Intro

This project implements Jaccard Similarity, Mihash Similarity and Locally Sensitive Hashing to detect and grade similarity between documents.

This README is purely a howto guide to compile and execute the program. For more detailed information on the actual implementation, please refer to the docs/ folder.

Requirements

This project has been developed in C++, thus, a C++ compiler is required.

The Boost C++ library has also been used, therefore Boost must be installed on the system.

Finally, CMake is the system used to manage the building process, therefore it's also needed.

All these requirements are available and easy to install on all major desktop platforms.

Comipiling

To compile, browse to the root directory of the project and execute cmake CMakeLists.txt. This will create the actual makefile and required files adapted to the current environment.

Then, run make and executables will be generated inside bin/ folder.

Running

To execute both the main script and the tests, a directory with .txt input files must be provided. The path can be relative or absolute.

To execute main program:

bin/comparator test/set1

To execute experiments:

bin/experiments test/set1