pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
PythonBSD-3-Clause
Issues
- 0
Has anyone run this code with bs>1 and speculatively?
#214 opened by deafTim - 1
- 2
Error with meta-llama/Llama-3.2-1B
#211 opened by deafTim - 8
Error with stories15M and stories110M
#209 opened by deafTim - 4
The Actual Throughput of int8 Quantization is Significantly Lower than Baseline on A100
#207 opened by crhcrhcrhcrh - 0
Request for Smaller Model Options (~1B Parameters)
#210 opened by deafTim - 2
Int4 perplexity
#125 opened by SinanAkkoyun - 2
Missing Keys in state_dict
#172 opened by bjohn22 - 4
int4/int4-gptq support in Mixtral 8x7B
#129 opened by yanbing-j - 3
Tensor Parallel Inside notebook
#167 opened by nivibilla - 0
Hard-coded Llama-3 model name pattern matching breaks scripts/convert_hf_checkpoint.py
#177 opened by ephremw - 1
Activation quantization support
#194 opened by ayyoobimani - 1
mmap issue in bf16 of gpt-fast
#165 opened by yanbing-j - 1
It doesn't accelerate very well at L4
#185 opened by songh11 - 1
tokenizer.model
#186 opened by hasakikiki - 1
Reasons for the poor effect of Speculative Sampling
#198 opened by JoeNan1 - 1
- 3
- 3
- 6
- 2
- 3
permute function in `convert_hf_checkpoint.py`
#190 opened by Sohaib9920 - 2
batching/dynamic batching
#112 opened by nivibilla - 9
- 3
Support of FlashDecoding
#188 opened by jianc99 - 8
- 2
INT4 quantization not working on MI210
#154 opened by yafehlis - 0
getting different acceptance prob when using `torch.compile` after making a small change.
#184 opened by kalradivyanshu - 0
GGUF support?
#182 opened by yukiarimo - 5
- 1
- 4
- 2
Throughput Benchmark Scripts
#173 opened by HanGuo97 - 2
Input token length question
#160 opened by kaizizzzzzz - 0
Naming: n_local_heads -> n_kv_heads
#162 opened by ad8e - 2
Tiny Llamas Not Found
#150 opened by guihao-liang - 4
On the memory usage of `ConditionalFeedForward`
#149 opened by carmocca - 2
- 1
- 0
AMD RX 7900 XTX Wrong outputs
#120 opened by makaveli10 - 3
What happens to bias during int8 quantization?
#108 opened by gchhablani - 1
- 0
Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization
#127 opened by daniyal214 - 0
- 3
Can GPT-Fast support larger batch sizes
#90 opened by yetingqiaqia - 2
pass@1 score extremely low using GPT-fast API
#94 opened by yafehlis - 2
- 8
Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation
#111 opened by duanzhaol - 3
- 1
`eval.py` uses older version of lm_eval
#89 opened by nairbv