AnswerDotAI/cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.
PythonBSD-3-Clause
Issues
- 1
How to get attention scores
#44 opened by wiluen - 0
Question of evaluation
#43 opened by freeSoul-SNU - 2
- 5
Does the repo support quantization methods? Does the repo support kv merge methods?
#40 opened by foreverpiano - 1
torch dependency results in error
#41 opened by maxjeblick - 5
- 0
Implement ThinK
#36 opened by griff4692 - 0
Implement PyramidInfer
#35 opened by griff4692 - 0
Implement InfLLM
#30 opened by griff4692 - 0
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
#29 opened by griff4692 - 0
LLama3 GIST
#2 opened by griff4692 - 0
- 0
Profile Llama3 Attention Heads
#4 opened by griff4692 - 0
- 0
Record Model Speed in evals
#6 opened by griff4692 - 0
Record memory consumption
#7 opened by griff4692 - 1
- 1
Experiment with Fixed Global Tokens
#5 opened by griff4692