Mistral Transformer
This repository contains minimal code to run our 7B model.
Blog: https://mistral.ai/news/announcing-mistral-7b/
Discord: https://discord.com/invite/mistralai
Installation
pip install -r requirements.txt
Download the model
wget https://files.mistral-7b-v0-1.mistral.ai/mistral-7B-v0.1.tar
tar -xf mistral-7B-v0.1.tar
Run the model
python -m main demo /path/to/mistral-7B-v0.1/
# To give your own prompts
python -m main interactive /path/to/mistral-7B-v0.1/
Change temperature
or max_tokens
using:
python -m main interactive /path/to/mistral-7B-v0.1/ --max_tokens 256 --temperature 1.0
If you want a self-contained implementation, look at one_file_ref.py
, or run it with
python -m one_file_ref /path/to/mistral-7B-v0.1/
This is a test of the emergency broadcast system. This is only a test.
If this were a real emergency, you would be told what to do.
This is a test
=====================
This is another test of the new blogging software. I’m not sure if I’m going to keep it or not. I’m not sure if I’m going to keep
=====================
This is a third test, mistral AI is very good at testing. 🙂
This is a third test, mistral AI is very good at testing. 🙂
This
=====================
To run logits equivalence through chunking and sliding window, launch
python -m test_generate
Sliding window attention
Vanilla attention
Attention is how information is shared between tokens in a sequence. In vanilla transformers, attention follows a causal mask: each token in the sequence can attend to itself and all the tokens in the past. This ensures that the model is causal, i.e. it can only use information from the past to predict the future.
Sliding window to speed-up inference and reduce memory pressure
The number of operations of attention is quadratic in the sequence length, and the memory pressure is linear in the sequence length. At inference time, this incurs higher latency and smaller throughput due to reduced cache availability. To alleviate this issue, we use a sliding window attention [1,2]: each token can attend to at most W tokens in the past (here, W=3).
Note that tokens outside the sliding window still influence next word prediction. At each attention layer, information can move forward by W tokens at most: after two attention layers, information can move forward by 2W tokens, etc. For instance in a sequence of length 16K and a sliding window of 4K, after 4 layers, information has propagated to the full sequence length.
Empirically, we see that longer contexts do help even outside the sliding window but when the sequence length becomes too large, the model does not use the full context anymore.
Rolling buffer cache
We implement a rolling buffer cache. The cache has a fixed size of W, and we store the (key, value) for position i in cache position i % W. When the position i is larger than W, past values in the cache are overwritten.
Pre-fill and chunking
When generating a sequence, we need to predict tokens one-by-one, as each token is conditioned on the previous ones. However, the prompt is known in advance, and we can pre-fill the (k, v) cache with the prompt. If the prompt is very large, we can chunk it into smaller pieces, and pre-fill the cache with each chunk. For this we can choose as chunk size the window size. For each chunk, we thus need to compute the attention over the cache and over the chunk.
More Links
Mistral-7B-v0.1 and Mistral-7B-Instruct-v0.1 are also available on HuggingFace.
References
[1] Generating Long Sequences with Sparse Transformers, Child et al. 2019
[2] Longformer: The Long-Document Transformer, Beltagy et al. 2020