Issues
- 0
- 1
Multi-node training
#305 opened by LeoXinhaoLee - 1
NotImplementedError running HF model "mlfoundations/dclm-7b-it" for inference
#303 opened by neginraoof - 0
How to pretrain on DCLM-BASELINE
#304 opened by mathfinder - 0
Webdataset version issue
#301 opened by GeorgiosSmyrnis - 2
I got an error in open lm installation
#297 opened by orhanerday - 0
Fine-Tuned Models for open_lm
#296 opened by OLMResearch - 1
composer ICL metrics deprecated
#288 opened by ysharma1126 - 0
Remote Sync FSSPEC cannot upload large checkpoints
#279 opened by Skylion007 - 4
"Number of shards requested for a single epoch is more than the number of shards available" in the middle of a training run
#189 opened by afang-story - 6
xfomers installation failed
#267 opened by stevensf1998 - 0
Reduce logging when --torchcompile is passed
#261 opened by achalddave - 1
- 1
- 5
- 3
MoE performs worse than equivalent dense model?
#253 opened by Muennighoff - 6
Make torch.compile work with fsdp and xformers
#72 opened by sagadre - 1
Fix tokenize shuffle issues (speed + correctness)
#212 opened by Vaishaal - 0
MoE Expert parallelism config
#251 opened by Muennighoff - 1
Someone is using your project to sell it as a token
#247 opened by yzthink - 3
Import from attention.py error
#202 opened by sedrick-keh-tri - 0
Support user specified token pre-processing functions
#194 opened by sagadre - 0
Factorize helper function for all model loading
#181 opened by sagadre - 0
Use distributed when world_size=1 if requested
#170 opened by achalddave - 0
grad accum tests failing on gpu w/ amp_bf16 precision
#171 opened by sagadre - 0
`--delete-previous-checkpoint` should delete prev checkpoints in `--remote-sync` bucket
#166 opened by sagadre - 0
Error early if we don't have enough disk space
#154 opened by achalddave - 1
- 0
Deduplicate argparse namespace creation for tests
#156 opened by achalddave - 0
- 2
Factor out parameter error checking
#107 opened by sagadre - 3
HF Integration
#89 opened by sedrick-keh-tri - 3
Add test for checkpoint loading after save
#145 opened by achalddave - 6
Figure out why AdamW + gradient accumulation leads to different results for test case
#126 opened by achalddave - 1
Minimize how often we load args.resume
#71 opened by achalddave - 0
Investigate effect of FSDP policies on mamba speed
#144 opened by sagadre - 0
- 1
- 2
Improve dataloading.
#70 opened by GeorgiosSmyrnis - 0
Move dummy cred download into test
#121 opened by achalddave - 0
clean up model_configs directory
#116 opened by kernelmachine - 1
error checking params.py
#95 opened by sagadre - 0
- 2
Dataloading Epoch Update Bug
#93 opened by sedrick-keh-tri - 0
open_lm chronicles
#90 opened by iejMac - 0
- 0
Use no_sync when doing gradient accumulation
#48 opened by achalddave - 0
- 0
Tokenization on-the-fly without slowdown
#55 opened by sagadre - 0
llama2 unit tests
#52 opened by sagadre