Issues
- 0
How Can I extract Last Layer Representation?
#86 opened by shantanu778 - 2
How is the autoregressive loss handled?
#82 opened by BabyCNM - 2
NO dropout in MLP and CausalSelfAttention
#29 opened by peter-ni-noob - 0
Avoid tiktoken.decode panic on unknown tokens.
#81 opened by IggShaman - 0
torch.compile-d models do not work with example generation and hellaswag eval
#79 opened by IggShaman - 8
- 5
Cannot get the log file "log124M_40B/log.txt"?
#47 opened by dtdo90 - 0
- 0
- 2
Running codes on Windows issues
#45 opened by gerardaristizabalpla4 - 7
Sharding the dataset not completing?
#25 opened by dustinwloring1988 - 2
- 1
- 0
Different inference results between flash attention and manually implemented attention appeared.
#50 opened by Jaeckel-d - 1
- 2
Executing with 1 GPU raises "OutOfMemory Exception", executing with 2 GPUs "RuntimeError: CUDA error: invalid device ordinal"
#41 opened by nmerkle - 1
Is dataloader making optimal batches?
#31 opened by paraschopra - 4
Implement tensor parallelism
#17 opened by marib00 - 11
- 2
- 2
Embeddings are initialized with std of 0.02
#18 opened by eryk-mazus