A large-scale RWKV v6 inference engine using the Cuda backend. Supports multi-batch generation and dynamic State switching.
This project aims to simplify the deployment of RWKV model inference in a Docker
- Support for multi-batch generation and stream delivery
- State switching for each batch
- OpenAI-compatible API
- Dynamic RNN State Cache(20240610) By dynamically caching RNN states, we have improved the efficiency of state regeneration frequency and accelerated inference speed.
-
- Install Latest Pytorch with Cuda(2.2+ tested)
-
- install requirements
pip install -r requirements.txt
-
- prepare models in models folder
-
- prepare states in states folder
-
- Run Server
python rwkv_server.py --localhost 0.0.0.0 --port 8000 --debug False --workers 16 --dynamic_state_cache_size 64
-
- Load Model
curl http://127.0.0.1:8000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"cuda fp16"}'
-
- Enjoy Infernce via OpenAI Compatible API!
-
- Model Load
curl http://127.0.0.1:8000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"cuda fp16"}'
-
- Add State
curl http://127.0.0.1:8000/loadstatemodel -X POST -H "Content-Type: application/json" -d '{"state_filename":"state.pth","state_viewname":"State Test"}'
-
- Remove All State
curl http://127.0.0.1:8000/removestatemodel -X POST -H "Content-Type: application/json" -d '{"dummy":"dummy"}'
-
- Get Model Names (During inference, setting the same name as this ID will enable dynamic state loading.)
curl http://127.0.0.1:8000/models -X GET
- (Done)Dynamic State Cache for faster inference
- Dynamic Swap LoRA(Torch Compile.......)
- RAG(Cold RAG)
- Research 4bit inference with 4bit matmul