togethercomputer/RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
PythonApache-2.0
Issues
- 0
Estimated Cost for Arxiv Download
#120 opened by amy-hyunji - 0
Filtering on Document Length
#118 opened by karan-dalal - 1
We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet
#64 opened by shawn0wang - 1
- 1
Exact dedup details
#115 opened by jordane95 - 6
regarding to deduplication
#79 opened by kimcando - 4
Other language data
#93 opened by Dzg0309 - 2
Thresholds for all quality signals
#92 opened by torshie - 2
what does the prefix "rps_" mean?
#114 opened by bpwl0121 - 5
slow transfer speeds from URL sources
#113 opened by axelmagn - 1
Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1
#112 opened by konradipipan - 1
Inconsistent IDs lead to distributed computing woes.
#111 opened by axelmagn - 2
Spanish artifact building error
#110 opened by hicotton02 - 1
Step 2) "Invalid option: ---input_base_uri"
#107 opened by timpal0l - 1
Potential Language Contamination Inquiry
#108 opened by iBibek - 2
About the final result
#105 opened by Jdemon233 - 2
- 8
What is the output of `run_lsh.py`?
#96 opened by virendrakabra14 - 1
Running the pipeline on cloud or a big data platform
#104 opened by zllai - 0
Running full pipeline on a small part of CC
#103 opened by zhentingqi - 0
Unavailable Parameters
#102 opened by zhentingqi - 0
Invalid uri: ParseResult(...) must be of the form s3://<bucket>/<key> or file://<path>
#101 opened by timpal0l - 4
what's the specific meaning of dsir?
#99 opened by BBetteroff - 1
- 2
Is there a specific meaning of the snapshot id?
#98 opened by zijwang - 2
possibly missing shard from host
#97 opened by sagnak - 1
Impossible unpack tail data... took time to download, but impossible to unpack dataset without quality signals with broken link.
#94 opened by RuslanKovalyov - 1
Are shards randomly created?
#95 opened by virendrakabra14 - 1
Low Data Downloading Speed
#89 opened by lipingtang17 - 1
Train a new wikiref model
#91 opened by torshie - 4
Deduplicated version of RedPajama-v2
#84 opened by joao-alves97 - 1
Request: Enable artifact prep on local data
#83 opened by hicotton02 - 2
Token counts
#88 opened by timsueberkrueb - 2
regarding to quality classifier
#86 opened by kimcando - 0
- 1
- 2
How is the SHA1 digest computed?
#81 opened by RicardoDominguez - 2
Invalid argument when running cc_net
#82 opened by Practicinginhell - 6
Executing V2 issues
#80 opened by hicotton02 - 2
Issue on book datasets download
#74 opened by beccabai - 2
Failed building wheel for cc-net
#67 opened by hicotton02 - 1
ArXiv cleaning issue
#68 opened by hicotton02 - 0
cc-net failure on slurm cluster
#72 opened by hicotton02 - 1
cc_net processing local wet file
#78 opened by hicotton02 - 0
New Features
#76 opened by zhangce - 3
What does "default" do in `load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?
#70 opened by brando90 - 1
Specifying arxiv dates
#71 opened by matthieumeeus - 1
- 1
- 0
Unlock open science for dataset generation
#66 opened by AbcSxyZ