An awesome list of data deduplication use cases, papers, tools, and methods.
- Fork this repository;
- Install the dependencies
pip install -r requirements.txt
andpre-commit install
; - Add your data to the corresponding folder by copying the
template.json
file; - Run
pre-commit run --all-files
to format the data; - Commit your changes and open a pull request to this repository.
Paper | Dataset | Final Data Size | Method | Hardware | License | Comments |
---|---|---|---|---|---|---|
NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | |
NA | RedPajama | 1.2T Tokens | SimHash (partial) | NA | Apache 2.0 | |
NA | SlimPajama | 627B Tokens | MinHash + LSH | NA | Apache 2.0 | |
arxiv | Multiple Sources | 200B ~ 400B tokens | MinHash | 200GB w/ 64 cores | Apache 2.0 | 1 |
Arxiv | CulturaX | 6.3T Tokens | MinHashLSH (per language) | 600 AWS c5.24xlarge (96/192GB * 600) | 1 |
Footnotes
-
This uses a variant of the spark script from text-dedup 🎉️; ↩ ↩2