The goal of this project is to develop a program that finds duplicate videos.
Python
Data-Mining-from-Large-Data-Sets
The goal of this project is to develop a program that finds duplicate videos. The videos were provided as shingles in a text file. There was also a training set for the video data. The program needed to run on a Map Reduce environment and should be developed with a Locality Sensitive Hashing (LSH) algorithm.
Environment
The running environment with a Map Reduce framework was provided. So the Mapper and Reducer were developed independently and then submitted to the running environment.
This project shows that LSH is a nice approach to find near duplicates in a large set of videos. It can be simply developed in a Map Reduce environment.