sangmichaelxie/doremi
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
HTMLMIT
Issues
- 0
Question about Group DRO implementation
#33 opened by NicholasCorrado - 0
Request for Redpajama Dataset Weights
#32 opened by desomeboy - 2
- 1
- 1
Question about 8B model architecture
#28 opened by Qinghao-Hu - 11
- 2
Cuda version problem
#27 opened by RRaphaell - 3
- 0
Question about model initialization
#30 opened by MAxx8371 - 17
Cannot reproduce the results shown in Github repo with the 120M reference model on A800 (8*80G).
#20 opened by kiseliu - 3
- 2
List of pinned requirements / Dockerfile?
#19 opened by filipg7777 - 1
Speed decrease during training
#24 opened by ljb121002 - 2
Questions about directly applying the weights from paper or the repo to train main model
#23 opened by clarkkent0618 - 1
Edge Case Discussion
#21 opened by thangld201 - 4
Question about optimized weights in the paper
#18 opened by yuzc19 - 1
Training time for baseline model and proxy model
#17 opened by yuzc19 - 4
- 1
- 2
easy HF dataset doremi?
#10 opened by brando90 - 1
- 3
- 2
loss computation wrong?
#9 opened by tt6746690 - 1
Question about Flash-attention version.
#12 opened by kiseliu - 3
Domain weights are mostly near one-hot
#5 opened by xiamengzhou - 1
- 1
Multi-nodes support
#6 opened by binxuan - 1
about loss
#3 opened by ywb2018 - 5
step 1 baseline_280M loss large
#1 opened by gawei1995 - 1
Adding a license
#2 opened by virtualzx-nad