mlfoundations/dclm

DataComp for Language Models

HTMLMIT

Issues

denied access while copying shards from aws s3 bucket
#103 opened a month ago by emirkaan5
3
Searching DCLM-baseline
#101 opened a month ago by chtmp223
0
Could you please release the 8.2B token for the 400M-1x setting?
#100 opened a month ago by xszheng2020
1
bloom filter occupied 90 % of memory on server with 836Gb available ram
#99 opened a month ago by ethany21
0
Cannot train
#97 opened 2 months ago by camilobrownpinilla
1
Understanding the pool sizes at each scale and the DCLM baseline
#91 opened 2 months ago by ameyagodbole
4
Build my own model and use DCLM-1B training script and dataset.
#95 opened 2 months ago by windbar778
0
Can bff read file formats other than jsonl?
#93 opened 2 months ago by ethany21
2
The dataset for training fastText OH-2.5 +ELI5 text classifier
#75 opened 4 months ago by yqy2001
3
Training data of model-based filtering
#74 opened 4 months ago by Yu-Shi
3
tokenization memory usage
#88 opened 3 months ago by brian-ham
1
How to download pools for smaller scale tracks
#87 opened 2 months ago by arnavmdas
1
Using Evaluation Prompts to Inform Data Selection
#86 opened 2 months ago by arnavmdas
1
bff deduplication removes >90% data with NaiveBoth remove type
#83 opened 2 months ago by XirenZhou
2
How can I calculate expected-ngram-count?
#90 opened 3 months ago by ethany21
1
Reproducing experiments in the paper
#85 opened 3 months ago by normster
2
buffer write is so slow
#67 opened 3 months ago by Yu-Shi
1
Cannot Interpret result of bff deduplication
#80 opened 3 months ago by ethany21
2
Availability of DCLM data mixes used in figure 3
#77 opened 4 months ago by IanMagnusson
4
Unable to ray up (part 2)
#79 opened 3 months ago by tonychenxyz
4
fasttext cannot be found
#78 opened 4 months ago by tonychenxyz
7
Unable to ray up
#69 opened 4 months ago by tonychenxyz
8
How do we just download the data necessary to enter competition?
#76 opened 4 months ago by davidbrandfonbrener
2
deduplication removes 98% of my data
#71 opened 4 months ago by Yu-Shi
2
Any plans to release pools after refinedweb heuristic filtering + dedup?
#59 opened 5 months ago by CodeCreator
9
Missing train_fasttext_classifier.py
#72 opened 4 months ago by yuzc19
2
Dedup methods
#54 opened 5 months ago by ch-shin
1
TypeError: Couldn't cast array of type
#66 opened 4 months ago by shizhediao
2
What is the pretrain scripts?
#68 opened 4 months ago by mathfinder
12
Need multi-node training script example
#70 opened 4 months ago by LeoXinhaoLee
2
Training on data with a fixed order
#65 opened 4 months ago by Yu-Shi
2
CommonCrawl WARC files for building mlfoundations/dclm-pool-400m-1x
#64 opened 5 months ago by Pab1x
2
Training crashes after some steps
#62 opened 5 months ago by Yu-Shi
8
Missing "default_dataset_yaml" for tokenization
#63 opened 5 months ago by chenweize1998
2
Instruction on DCLM-Baseline reproduction and Filtering track
#44 opened 5 months ago by chenweize1998
5
About the `--num_checkpoints` argument in pretraining
#61 opened 5 months ago by Yu-Shi
2
Missing baselines/mappers/banlists/refinedweb_banned_domains_curated.txt
#55 opened 5 months ago by yuzc19
2
Example command using ray_processing/process.py
#47 opened 5 months ago by tonychenxyz
4
Does a Higher `fasttext_oh_eli5_vs_rw_v2_prob` Indicate Better Data Quality?
#53 opened 5 months ago by huyiwen
1
Local data processing (non-AWS)
#52 opened 5 months ago by ryoungj
1
How to train and fine-tuning model
#34 opened 5 months ago by Jackjiayou
1
Getting path issue when trying to load language model
#37 opened 5 months ago by humzaiqbal
2
Training variance
#48 opened 5 months ago by ttccxx
6
The lack of the experimental environment introduction in the paper
#49 opened 5 months ago by flyflypeng
4
What is the pearson correlation in lighteval scores between 1B/400M model and 7B model?
#46 opened 5 months ago by ZefanW
1
Missing training model configs
#41 opened 5 months ago by ch-shin
1
Release of Trained Models on DCLM-Baseline
#50 opened 5 months ago by m1k2zoo
1
Is the model architecture of DCLM different from LLaMA?
#45 opened 5 months ago by czczup
2
BFF code？
#33 opened 5 months ago by luludus
1
Missing FastText Config File
#40 opened 5 months ago by purefall
1