pytorch-labs/gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.

PythonBSD-3-Clause

Issues

Has anyone run this code with bs>1 and speculatively?
#214 opened a month ago by deafTim
0
Mistake in 191 line if is_speculative=True generate.py ?
#213 opened a month ago by deafTim
1
Error with meta-llama/Llama-3.2-1B
#211 opened a month ago by deafTim
2
Error with stories15M and stories110M
#209 opened a month ago by deafTim
8
The Actual Throughput of int8 Quantization is Significantly Lower than Baseline on A100
#207 opened a month ago by crhcrhcrhcrh
4
Request for Smaller Model Options (~1B Parameters)
#210 opened a month ago by deafTim
0
Int4 perplexity
#125 opened 2 months ago by SinanAkkoyun
2
Missing Keys in state_dict
#172 opened 7 months ago by bjohn22
2
int4/int4-gptq support in Mixtral 8x7B
#129 opened 2 months ago by yanbing-j
4
Tensor Parallel Inside notebook
#167 opened 7 months ago by nivibilla
3
Hard-coded Llama-3 model name pattern matching breaks scripts/convert_hf_checkpoint.py
#177 opened 2 months ago by ephremw
0
Activation quantization support
#194 opened 3 months ago by ayyoobimani
1
mmap issue in bf16 of gpt-fast
#165 opened 7 months ago by yanbing-j
1
It doesn't accelerate very well at L4
#185 opened 5 months ago by songh11
1
tokenizer.model
#186 opened 5 months ago by hasakikiki
1
Reasons for the poor effect of Speculative Sampling
#198 opened 3 months ago by JoeNan1
1
trying to convert huggingface whisper model to pytorch
#189 opened 4 months ago by nullonesix
1
RuntimeError: CUDA error: named symbol not found
#87 opened 10 months ago by ce1190222
3
Can't quantize to int4 and can't compile on RTX2080Ti
#124 opened 7 months ago by kaizizzzzzz
3
Question about the gennerated code of `WeightOnlyInt8Linear`
#114 opened 9 months ago by feiyuvl
6
Size mismatch error occurs when loading models quantized by GPTQ
#88 opened 10 months ago by sdc17
2
permute function in `convert_hf_checkpoint.py`
#190 opened 4 months ago by Sohaib9920
3
batching/dynamic batching
#112 opened 9 months ago by nivibilla
2
Question about the ENABLE_INTRA_NODE_COMM for speculative decoding
#183 opened 4 months ago by jianc99
9
Support of FlashDecoding
#188 opened 4 months ago by jianc99
3
CUDA error if enabling compile_prefill for quantization model (int8)
#137 opened 8 months ago by yanboliang
8
INT4 quantization not working on MI210
#154 opened 4 months ago by yafehlis
2
getting different acceptance prob when using `torch.compile` after making a small change.
#184 opened 5 months ago by kalradivyanshu
0
GGUF support?
#182 opened 5 months ago by yukiarimo
0
`meta-llama/Meta-Llama-3-8B-Instruct` generates gibberish for long prompts
#179 opened 5 months ago by griff4692
5
I try to speed up with llava,but this it slower then eager mode,why?
#92 opened 10 months ago by bleedingfight
1
int8 Woq raise Codegen Error with `--compile_prefill`
#144 opened 8 months ago by yanbing-j
4
Throughput Benchmark Scripts
#173 opened 6 months ago by HanGuo97
2
Input token length question
#160 opened 7 months ago by kaizizzzzzz
2
Naming: n_local_heads -> n_kv_heads
#162 opened 7 months ago by ad8e
0
Tiny Llamas Not Found
#150 opened 7 months ago by guihao-liang
2
On the memory usage of `ConditionalFeedForward`
#149 opened 8 months ago by carmocca
4
Questions on Speculative Decoding in gpt-fast generate.py
#107 opened 9 months ago by hxer7963
2
Question about large sequence length attention kernels
#140 opened 8 months ago by loubbrad
1
AMD RX 7900 XTX Wrong outputs
#120 opened 8 months ago by makaveli10
0
What happens to bias during int8 quantization?
#108 opened 9 months ago by gchhablani
3
index out of range: No transformer config could be loaded
#126 opened 8 months ago by SinanAkkoyun
1
Reducing Latency in Application with Torch Compilation: Initialization and Inference Optimization
#127 opened 8 months ago by daniyal214
0
Speculative decoding with draft model:TinyLlama-1.1B
#122 opened 9 months ago by kaizizzzzzz
0
Can GPT-Fast support larger batch sizes
#90 opened 9 months ago by yetingqiaqia
3
pass@1 score extremely low using GPT-fast API
#94 opened 10 months ago by yafehlis
2
How is llama-7b trained, what is the verification accuracy?
#86 opened 9 months ago by frankxyy
2
Try Tensor Parallel on a server equipped with two V100 linked by NVLINK, but got a performance degradation
#111 opened 9 months ago by duanzhaol
8
Bandwidth achieved for INT8 is much smaller than FP16
#99 opened 9 months ago by yafehlis
3
`eval.py` uses older version of lm_eval
#89 opened 9 months ago by nairbv
1