MinHash lyric deduplication

Music lyrics deduplication using MinHash Locality Sensitive Hashing

This project implements a deduplication technique using Locality Sensitive Hashing.

Requirements

This project requires python 3, plus the additional packages listed below:

  • package dataskecth
  • package editdistance
  • package pickle

These packages may be installed using your OS package manager or pip3

pip3 install datasketch editdistance pickle -U

Order of the scripts

First the crawler scripts must be run in order to obtain some data to process.

References

Crawler

Python MinHash LSH