Issues
- 1
- 0
This work doesn't change kernel, but utilize dependency to compute a whole line?
#20 opened by ziyuhuang123 - 0
Could you provice GPU code like A100?
#19 opened by ziyuhuang123 - 1
Incorrect project requirements
#16 opened by hadipash - 2
vmem OOM on TPU
#11 opened by hxssgaa - 1
Pretrained models?
#10 opened by matteoguarrera - 10
Question: Has this been tested against the Trition Flash Attention version?
#2 opened by casper-hansen - 1
scripts/jax2hf. py error
#17 opened by liuxpro - 2
Questions about the paper
#14 opened by hiroshinoji - 10
PyTorch Implementation
#4 opened by conceptofmind - 0
Test Script Issues
#15 opened by djbyrne - 4
- 0
fine-tuning model mismatch - KeyError
#13 opened by chenwuperth - 0
JAX partitioning error when attempting to run with sequence parallelism factor not a power of 2
#9 opened by exists-forall - 1
train_dataset. download
#5 opened by lljjgg - 2
How to combine BPT with sequence parallel?
#1 opened by fanghgit