/awesome-data-deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

Primary LanguagePythonMIT LicenseMIT

Awesome Data Deduplication

An awesome list of data deduplication use cases, papers, tools, and methods.

How to contribute

  1. Fork this repository;
  2. Install the dependencies pip install -r requirements.txt and pre-commit install;
  3. Add your data to the corresponding folder by copying the template.json file;
  4. Run pre-commit run --all-files to format the data;
  5. Commit your changes and open a pull request to this repository.

Textual Data

Paper Dataset Final Data Size Method Hardware License Comments
NA RedPajama 1.2T Tokens SimHash (partial) NA Apache 2.0
NA RedPajama 1.2T Tokens SimHash (partial) NA Apache 2.0
NA SlimPajama 627B Tokens MinHash + LSH NA Apache 2.0
arxiv Multiple Sources 200B ~ 400B tokens MinHash 200GB w/ 64 cores Apache 2.0 1
Arxiv CulturaX 6.3T Tokens MinHashLSH (per language) 600 AWS c5.24xlarge (96/192GB * 600) 1

Image Data

Multi-modal Data

Footnotes

  1. This uses a variant of the spark script from text-dedup 🎉️; 2