/Data-Mining-from-Large-Data-Sets

The goal of this project is to develop a program that finds duplicate videos.

Primary LanguagePython

Data-Mining-from-Large-Data-Sets

The goal of this project is to develop a program that finds duplicate videos. The videos were provided as shingles in a text file. There was also a training set for the video data. The program needed to run on a Map Reduce environment and should be developed with a Locality Sensitive Hashing (LSH) algorithm.

Environment

The running environment with a Map Reduce framework was provided. So the Mapper and Reducer were developed independently and then submitted to the running environment.

This project shows that LSH is a nice approach to find near duplicates in a large set of videos. It can be simply developed in a Map Reduce environment.