quduoduo/RWKV-Infer

A large-scale RWKV v6 inference wrapper using the Cuda backend. Easy to deploy on docker. Supports multi-batch generation and dynamic State switching. Let's spread RWKV, which combines RNN technology with impressively low inference costs!

PythonApache-2.0

RWKV-Infer

A large-scale RWKV v6 inference engine using the Cuda backend. Supports multi-batch generation and dynamic State switching.

This project aims to simplify the deployment of RWKV model inference in a Docker

The following features are included:

Support for multi-batch generation and stream delivery
State switching for each batch
OpenAI-compatible API
Dynamic RNN State Cache(20240610) By dynamically caching RNN states, we have improved the efficiency of state regeneration frequency and accelerated inference speed.

How To Use

1. Install Latest Pytorch with Cuda(2.2+ tested)
1. install requirements

pip install -r requirements.txt

1. prepare models in models folder
1. prepare states in states folder
1. Run Server

python rwkv_server.py --localhost 0.0.0.0 --port 8000 --debug False --workers 16 --dynamic_state_cache_size 64

1. Load Model

curl http://127.0.0.1:8000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"cuda fp16"}'

1. Enjoy Infernce via OpenAI Compatible API!

API Examples

1. Model Load

curl http://127.0.0.1:8000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth","model_viewname":"RWKV x060 1B6 Base","model_strategy":"cuda fp16"}'

1. Add State

curl http://127.0.0.1:8000/loadstatemodel -X POST -H "Content-Type: application/json" -d '{"state_filename":"state.pth","state_viewname":"State Test"}'

1. Remove All State

curl http://127.0.0.1:8000/removestatemodel -X POST -H "Content-Type: application/json" -d '{"dummy":"dummy"}'

1. Get Model Names (During inference, setting the same name as this ID will enable dynamic state loading.)

curl http://127.0.0.1:8000/models -X GET

ToDo for me

(Done)Dynamic State Cache for faster inference
Dynamic Swap LoRA(Torch Compile.......)
RAG(Cold RAG)
Research 4bit inference with 4bit matmul