An example of the RWKV approach to language models written in Rust by someone that knows very little about math or neural networks. The initial version was very, very heavily based on the amazing information and Python example here: https://johanwind.github.io/2023/03/23/rwkv_details.html
Also see the RWKV creator's repository: https://github.com/BlinkDL/ChatRWKV/
- Written in Rust. Static typing can really help when trying to understand something, since it's clear what type of thing every object is.
- Doesn't depend on massive frameworks like Torch or Cuda.
- Can use all threads/cores for inference.
- Supports float32 and 8bit inference as well as 4bit inference with GGML.
- Not really optimized for performance.
- Can only quantize on the fly (can by slow for big models).
- Can only run inference on CPU.
If loading in 32bit mode it uses a lot of memory. The 3B model uses around 11GB RAM and the 7B one might just fit on a 32GB machine you're willing to close other applications or deal with some swapping. Even loading in 8bit mode uses a fair amount of memory, but it will drop down once loading has completed.
You'll need Rust and Cargo set up: https://www.rust-lang.org/learn/get-started
You will need to download a RWKV model. Here's a link to get you started (about 820MB): https://huggingface.co/BlinkDL/rwkv-4-pile-430m/resolve/main/RWKV-4-Pile-430M-20220808-8066.pth
Also the tokenizer here: https://github.com/BlinkDL/ChatRWKV/blob/main/20B_tokenizer.json
PyTorch model files can be loaded directly. If the files ends with .pt
or .pth
it will be loaded as a PyTorch
model. If it ends with .st
or .safetensors
then it will be loaded as SafeTensors. Note: The PyTorch support
is currently experimental and may not function correctly. You will likely just immediately get an error if there is a problem
so it shouldn't be dangerous to try that approach. If you want, you can disable the torch
feature and only build
support for SafeTensors format files.
After that, you should just be able to cargo run --release
. You can try compiling without --release
but
it's likely everything will be insanely slow. Also try cargo run --release -- --help
to see commandline options.
Note: The default is to use all logical cores, see the commandline options.
You can optionally convert the .pth
model file to SafeTensors format. Look at
utils/pth_to_safetensors.py
for an example.
To do this, you'll need the safetensors
and torch
Python packages set up.
I suggest doing this in a virtual environment. Currently there isn't much of an
advantage to this step as the Torch files can be loaded directly in the current
version.
GGML support currently needs a patched version of ggml
and ggml-sys
from the llama-rs
project.
The Cargo.toml
is set up to point to the correct branch in my fork, but this will go away once the
necessary changes are merged into GGML. Naturally this repo will be updated, but bear in mind
your compiles may start failing eventually if you're trying to use an older version, since eventually
that branch will be removed.
Note: This part is kind of outdated now. I still recommend reading the links below though. Note also that that description is based on a simpler version of the RWKV model with only four states per layer. The full version has five.
Here is a (possibly wrong) high level description of the steps involved in evaluating the model.
You will need to refer to the source in smolrwkv/src/simple/model.rs
for this to make sense.
Also, strongly consider reading these first:
- https://johanwind.github.io/2023/03/23/rwkv_overview.html — High level explanation.
- https://johanwind.github.io/2023/03/23/rwkv_details.html — More detailed explanation with a Python example.
By the way, fun fact: "Tensor" sounds real fancy but it's basically just an array. A one dimensional tensor is just a one dimensional array, a two dimensional dimensional tensor is a two dimensional array. They can have special properties (like being immutable) but that doesn't matter for understanding the concept in general. If you know arrays, you have the general idea of tensors already.
To evaluate a token:
- Calculate an initial value for
x
fromln0
. - Feed this
x
to each layer sequentially, using thex
the layer generated for the next one.- Take
x
that got fed in. - Apply
ln1
tox
and feed it to time mixing. This uses tensor from the FFN part of the model.- Take
tm_state
from the layer state and call itlast_x
. (Why? Who knows!) - Take
tm_num
andtm_den
aslast_num
,last_den
. - Do a bunch of fancy math stuff I'm not qualified to explain.
- The above calculated new values for
tm_[state,num,den]
so update your layer state with these. - Also return
x
that resulted from the calculations.
- Take
- Add the
x
from time mixing tox
(x += time_mixing_x
). - Apply
ln2
tox
and feed it to channel mixing. This uses tensors from the feed forward network part of the model.- Take
cm_state
from the layer state and call itlast_x
. - More fancy math stuff (less involved than time mixing though).
- As with time mixing, this will calculate a new
cm_state
so update the layer state. - Return
x
that resulted from the channel mixing calculation.
- Take
- Add the
x
from channel mixing tox
.
- Take
- Do fancy math stuff to the
x
that was the result after evaluating the last layer. - Return it as the list of probabilities for each token.
The model has a list of tokens it "knows". Sometimes a token is equal to a word, sometimes it's just part of a word. There are usually a large number of tokens, in the range of 30,000-60,000. I believe the current RWKV models have 50,277 tokens. Anyway, you'll get a list of 50,277 floating point numbers back after running the model.
The highest value from that list is the token the model predicts is the most likely continuation and so on. If you generated a sorted list of the top 10-40 or so token probabilities and select one randomly, you'll get fairly reasonable output, relatively speaking. Fair to say a tiny 430M model doesn't produce the most reasonable output in general.
Good explanation of how to handle the next step once you have the list of probabilities: https://huggingface.co/blog/how-to-generate
There's various complicated math stuff involved in evaluating the model, but the only thing that really
matters is the matrix multiplication (pardot
in the source). In the case of RWKV it's matrix-vector
multiplication (a 2D array multiplied with a 1D array). >90% of the time spent evaluating the model
is in those matrix multiplication calls.
In non-GGML mode, the math/array handling here uses the ndarray
crate. It provides a .dot
function,
however this will never actually calculate a matrix-vector multiplication in parallel even though the
crate claims threading support. Because this calculation is so critical for performance, I ended up writing
my own function to split the calculation into chunks and run it in parallel. See the functions in the
dumdot
module in smolrwkv/src/util.rs
.
The fact that you get a list of probabilities back and and no definite "answer" from the model seems like a decent counterargument to the idea that LLMs are or could be conscious in some way. When you look at output from an LLM, a lot of the time you aren't even going to be seeing the most likely token. Also, fun fact: When you feed a prompt to a model, it comes up with a list of probabilities just like when you're asking it for a response. However, those probabilities are just thrown away except for the result after processing the very last prompt token.
Prompt in bold. So, are the dragons tree snakes or dogs? The world may never know!
* Loading tokenizer from: ./20B_tokenizer.json
* Loading model from: ./RWKV-4-Pile-430M-20220808-8066.safetensors
* Discovering model structure.
- Loading layer 1/24
[...]
- Loading layer 24/24
* Loading non-layer tensors.
* Loaded: layers=24, embed=1024, vocab=50277
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.
These dragons all spoke different dialects and these dialects didn’t match the dogs' native language.
In an attempt to decipher what these dragons spoke, they called the dragons and found that their language was different from human.
"The Dragons understood human words and more precisely human languages. The dragons spoke the human language. They also understood the rules for Chinese,” the research team told Mongabay.
By conducting the research, they are hoping to shed light on the mysterious history of the dragons in the remote, remote regions of the world, especially in Tibet.
The research project, published in the journal Open Science, also shows that dragons are, in fact, reptiles, or a.k.a. tree snakes.
Dragon, not snake
According to the research team, the dragons found in Tibet are a race of dogs, not a reptile.
While the research team was still unable to come up with any explanation as to why these dragons live in Tibet, it was previously believed that they were most likely present on land near the Tibetan plateau.
"The dragons live there as part of the great Qinghai-Tibet Plateau that is almost completely undisturbed and the entire Qinghai-Tibet plateau was gradually converted to an agricultural state. Therefore, they have a distinctive pattern of chewing on the trees, and probably the animals are not too big to be kept in nature," the researchers explained.