togethercomputer/RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

PythonApache-2.0

Issues

Estimated Cost for Arxiv Download
#120 opened 2 months ago by amy-hyunji
0
Filtering on Document Length
#118 opened 5 months ago by karan-dalal
0
We then run the same cc-net pipeline on warc_wikipedia.warc, which produces warc_wikipedia.warc.wet
#64 opened 5 months ago by shawn0wang
1
Inquiry About Character-Level Basis of Duplication Calculation
#116 opened 6 months ago by luc1fer3
1
Exact dedup details
#115 opened 7 months ago by jordane95
1
regarding to deduplication
#79 opened a year ago by kimcando
6
Other language data
#93 opened a year ago by Dzg0309
4
Thresholds for all quality signals
#92 opened a year ago by torshie
2
what does the prefix "rps_" mean?
#114 opened 8 months ago by bpwl0121
2
slow transfer speeds from URL sources
#113 opened 9 months ago by axelmagn
5
Difference between RedPajama-Data-1T, RedPajama-Data-V2, RedPajama-Data-V1
#112 opened 10 months ago by konradipipan
1
Inconsistent IDs lead to distributed computing woes.
#111 opened 10 months ago by axelmagn
1
Spanish artifact building error
#110 opened 10 months ago by hicotton02
2
Step 2) "Invalid option: ---input_base_uri"
#107 opened 10 months ago by timpal0l
1
Potential Language Contamination Inquiry
#108 opened 10 months ago by iBibek
1
About the final result
#105 opened a year ago by Jdemon233
2
What purpose cutoff.csv used in the cc_net pipeline?
#106 opened a year ago by kemalbastak
2
What is the output of `run_lsh.py`?
#96 opened a year ago by virendrakabra14
8
Running the pipeline on cloud or a big data platform
#104 opened a year ago by zllai
1
Running full pipeline on a small part of CC
#103 opened a year ago by zhentingqi
0
Unavailable Parameters
#102 opened a year ago by zhentingqi
0
Invalid uri: ParseResult(...) must be of the form s3://<bucket>/<key> or file://<path>
#101 opened a year ago by timpal0l
0
what's the specific meaning of dsir?
#99 opened a year ago by BBetteroff
4
Recommended way to load wget-downloaded data using HF datasets API?
#100 opened a year ago by zijwang
1
Is there a specific meaning of the snapshot id?
#98 opened a year ago by zijwang
2
possibly missing shard from host
#97 opened a year ago by sagnak
2
Impossible unpack tail data... took time to download, but impossible to unpack dataset without quality signals with broken link.
#94 opened a year ago by RuslanKovalyov
1
Are shards randomly created?
#95 opened a year ago by virendrakabra14
1
Low Data Downloading Speed
#89 opened a year ago by lipingtang17
1
Train a new wikiref model
#91 opened a year ago by torshie
1
Deduplicated version of RedPajama-v2
#84 opened a year ago by joao-alves97
4
Request: Enable artifact prep on local data
#83 opened a year ago by hicotton02
1
Token counts
#88 opened a year ago by timsueberkrueb
2
regarding to quality classifier
#86 opened a year ago by kimcando
2
where should I go to get the file about "domain_to_category_id.json"?
#87 opened a year ago by suolyer
0
quality_signals, minhash and duplicates missing for tail
#77 opened a year ago by Sheshansh
1
How is the SHA1 digest computed?
#81 opened a year ago by RicardoDominguez
2
Invalid argument when running cc_net
#82 opened a year ago by Practicinginhell
2
Executing V2 issues
#80 opened a year ago by hicotton02
6
Issue on book datasets download
#74 opened a year ago by beccabai
2
Failed building wheel for cc-net
#67 opened a year ago by hicotton02
2
ArXiv cleaning issue
#68 opened a year ago by hicotton02
1
cc-net failure on slurm cluster
#72 opened a year ago by hicotton02
0
cc_net processing local wet file
#78 opened a year ago by hicotton02
1
New Features
#76 opened a year ago by zhangce
0
What does "default" do in `load_dataset('togethercomputer/RedPajama-Data-1T', "default")`?
#70 opened a year ago by brando90
3
Specifying arxiv dates
#71 opened a year ago by matthieumeeus
1
Q: Why does RePajama exist? what problem are you solving?
#69 opened a year ago by brando90
1
I got an issue when I use fasttext doing arxiv cleaning.
#65 opened a year ago by tangtianyi1998
1
Unlock open science for dataset generation
#66 opened a year ago by AbcSxyZ
0