NVIDIA/NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
Jupyter NotebookApache-2.0
Issues
- 3
Graceful handling when no LSH duplicates found.
#381 opened by davzoku - 4
- 0
- 0
Add CI tests for Hugging Face classifiers
#405 opened by sarahyurick - 5
Translation example with ctranslate2's Translator.
#246 opened by uahmed93 - 0
Add Trafilatura text extraction
#400 opened by sarahyurick - 1
- 2
Connected Components Speedup breaks tutorial examples
#380 opened by davzoku - 1
Improve Pytorch Model Performence
#329 opened by VibhuJawa - 0
Update `columns` documentation
#378 opened by sarahyurick - 0
Use CrossFit for `TokenizerFertilityFilter`
#377 opened by sarahyurick - 0
Add GPU test with NeMo 2.0
#376 opened by sarahyurick - 0
- 0
Add more comprehensive `DocumentDataset` PyTests
#371 opened by sarahyurick - 4
Data Curation Failure in NVIDIA NeMo Docker Container (nvcr.io/nvidia/nemo:24.07) Using DataCurator with JSONL Files
#348 opened by MalekSolta - 1
Cannot run `gpu_exact_dups` due to deprecated flag
#350 opened by davzoku - 2
Regarding the issue of AEGIS model labels
#349 opened by xiafeng-nb - 5
GitHub workflows improvements
#259 opened by sarahyurick - 0
- 0
- 0
- 2
Japanese support for get_word_splitter
#257 opened by oumugai - 1
- 1
- 0
- 1
Deprecate `max_text_bytes_per_part`
#331 opened by sarahyurick - 1
Make `max_text_bytes_per_part` configurable
#233 opened by ayushdg - 2
Check Pytorch cuda context is valid across GPUs
#284 opened by VibhuJawa - 1
- 1
Add `memory_limit` to `get_client` function
#324 opened by sarahyurick - 1
ADD DCO signing as a pre-commit check
#290 opened by VibhuJawa - 0
Write to file w/o including `filename` column
#293 opened by joshwyatt - 0
- 0
- 1
Curator should support numpy > 2
#287 opened by praateekmahajan - 0
- 3
Unmanaged memory is high and frozen execution
#295 opened by pappagari - 4
- 1
Cleanup Github Branches
#289 opened by VibhuJawa - 0
[DOC] Add documentation for fineweb edu classifier
#248 opened by VibhuJawa - 0
Resuming the job on slurm after it gets cancelled.
#297 opened by uahmed93 - 1
Expand RMM options for Python API
#260 opened by sarahyurick - 1
Semantic Dedup doesn't work with UCX
#283 opened by praateekmahajan - 3
Installation error: pip._vendor.resolvelib.resolvers.ResolutionTooDeep: 200000
#278 opened by pappagari - 0
[FEA] Improve separate_by_metadata performance when dealing with jsonl files
#255 opened by miguelusque - 0
- 1
Exact/fuzzy deduplication bug
#252 opened by yyu22 - 3
Grammar and punctuation nits in Jupyter Notebooks
#228 opened by sarahyurick - 0
Re-add `test_uneven_common_crawl_range` PyTest
#236 opened by sarahyurick - 1