bigscience-workshop/data-preparation
Code used for sourcing and cleaning the BigScience ROOTS corpus
Jupyter NotebookApache-2.0
Issues
- 0
- 0
- 0
Changing parmater values to extreme in parameters_filtering.py doesn't change the no. f documents being removed
#42 opened by dk-github-acc - 0
author = "bigscience-catalogue-lm-data", there is no this data in Huggingface.
#41 opened by belle9217 - 0
Can't find the Deduplication Report
#40 opened by longxudou - 0
Extending this codebase
#39 opened by chris-ha458 - 0
the version of simhash
#38 opened by wang9702 - 0
- 1
Add citation
#14 opened by HugoLaurencon - 0
Check links
#9 opened by HugoLaurencon - 0
- 0
Write READMEs
#6 opened by thomasw21