ChenghaoMou/awesome-data-deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

PythonMIT

Awesome Data Deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

How to contribute

Fork this repository;
Install the dependencies pip install -r requirements.txt and pre-commit install;
Add your data to the corresponding folder by copying the template.json file;
Run pre-commit run --all-files to format the data;
Commit your changes and open a pull request to this repository.

Textual Data

Paper	Dataset	Final Data Size	Method	Hardware	License	Comments
NA	RedPajama	1.2T Tokens	SimHash (partial)	NA	Apache 2.0
NA	RedPajama	1.2T Tokens	SimHash (partial)	NA	Apache 2.0
NA	SlimPajama	627B Tokens	MinHash + LSH	NA	Apache 2.0
arxiv	Multiple Sources	200B ~ 400B tokens	MinHash	200GB w/ 64 cores	Apache 2.0	¹
Arxiv	CulturaX	6.3T Tokens	MinHashLSH (per language)	600 AWS c5.24xlarge (96/192GB * 600)		¹

Image Data

Multi-modal Data

This uses a variant of the spark script from text-dedup 🎉️; ↩ ↩²