Python script to demonstrate how to invoke models such as WizardCoder from the command-line, with bitsandbytes 4-bit quantization.
Intends to support the following models:
WizardLM/WizardCoder-Python-7B-V1.0
WizardLM/WizardCoder-Python-13B-V1.0
WizardLM/WizardCoder-Python-34B-V1.0
codellama/CodeLlama-7b-Instruct-hf
codellama/CodeLlama-13b-Instruct-hf
codellama/CodeLlama-34b-Instruct-hf
CodeLlama models were trained on 16000 token sequences.
WizardCoder was finetuned on 2048 token sequences.
WizardCoder-Python-34B-V1.0 surpasses GPT4, ChatGPT-3.5 and Claude2 on HumanEval benchmarks.
All instructions are written assuming your command-line shell is bash.
Clone repository:
git clone https://github.com/Birch-san/wizardcoder-play.git
cd wizardcoder-play
This is to avoid interfering with your current Python environment (other Python scripts on your computer might not appreciate it if you update a bunch of packages they were relying on).
Follow the instructions for virtualenv, or conda, or neither (if you don't care what happens to other Python scripts on your computer).
Create environment:
python -m venv venv
pip install --upgrade pip
Activate environment:
. ./venv/bin/activate
(First-time) update environment's pip
:
pip install --upgrade pip
Download conda.
Skip this step if you already have conda.
Install conda:
Skip this step if you already have conda.
Assuming you're using a bash
shell:
# Linux installs Anaconda via this shell script. Mac installs by running a .pkg installer.
bash Anaconda-latest-Linux-x86_64.sh
# this step probably works on both Linux and Mac.
eval "$(~/anaconda3/bin/conda shell.bash hook)"
conda config --set auto_activate_base false
conda init
Create environment:
conda create -n p311-llama python=3.11
Activate environment:
conda activate p311-llama
Ensure you have activated the environment you created above.
Install dependencies:
pip install -r requirements.txt
The PyTorch nightlies may be more performant. Until PyTorch 2.1.0 stable comes out (~October 4th), nightlies are the best way to get CUDA 12.1 support:
# CUDA
pip install --upgrade --pre torch --extra-index-url https://download.pytorch.org/whl/nightly/cu121
To accelerate inference and reduce memory usage, install flash-attn
.
First we install the package itself:
pip install flash-attn --no-build-isolation
Then we build-from-source its rotary embeddings kernel (there is no officially-distributed wheel):
MAX_JOBS=2 pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary
[Building rotary
from source] error: expected template-name before ‘<’ token
:
If you compiled flash-attn source using nvcc 12.x (i.e. CUDA Toolkit 12), you will encounter the following error whilst compiling pybind11's cast.h
header:
/home/birch/anaconda3/envs/p311-cu121-bnb-opt/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h: In function ‘typename pybind11::detail::type_caster<typename pybind11::detail::intrinsic_type<T>::type>::cast_op_type<T> pybind11::detail::cast_op(make_caster<T>&)’:
/home/birch/anaconda3/envs/p311-cu121-bnb-opt/lib/python3.11/site-packages/torch/include/pybind11/detail/../cast.h:45:120: error: expected template-name before ‘<’ token
45 | return caster.operator typename make_caster<T>::template cast_op_type<T>();
Solution here.
From root of repository:
python -m scripts.wizard_play
Fun command-line options:
--model_name_or_path WizardLM/WizardCoder-Python-7B-V1.0 --prompt_style wizardcoder-python
: use WizardCoder 7B with WizardCoder prompting style--model_name_or_path codellama/CodeLlama-7b-Instruct-hf --prompt_style codellama-instruct
: use CodeLlama-7b-Instruct with CodeLlama-Instruct prompting style--flash --trust_remote_code
: enables flash attention 2 viaflash-attn
library and (my fork of) togethercomputer'smodeling_flash_llama.py
--max_new_tokens 2048
: modify maximum response length--chat_memory
: enable conversation history, for multi-turn conversations (CodeLlama-Instruct was trained on this, but WizardCoder was not)--initial_input 'Write a function which computes the Fibonacci sequence.'
: you can buffer a prompt to be submitted as soon as the model's loaded.
You can press Ctrl+C whilst the model is generating a response, to interrupt it. If --chat_memory
is enabled: the unfinished message does get persisted into the conversation history.
If the model is not generating a response, then Ctrl+C will exit the software.
You can seed the conversation history with a previous input and forced response from the model:
python -m scripts.wizard_play --model_name_or_path codellama/CodeLlama-7b-Instruct-hf --prompt_style codellama-instruct --shot0_input "Read user's name from stdin" --shot0_response 'import sys
name = input("Enter your name: ")
print("Your name is:", name)'
This achieves two things:
- creates a memory in the conversation
- sets an expectation for what kind of style of response you prefer.
You can see this in action, by asking the model to iterate on the solution you placed into its history:
[seed=64]$ Print their age too.
import sys
name = input("Enter your name: ")
age = input("Enter your age: ")
print("Your name is:", name, ",", "and", "your age:", age)
Note: this won't necessarily work so well for WizardCoder, which isn't trained in multi-turn conversations.
cannot import name 'translate_llvmir_to_hsaco'
:
You need a triton nightly.
Failed to import transformers.models.llama.modeling_llama because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
cannot import name 'translate_llvmir_to_hsaco' from 'triton._C.libtriton.triton' (unknown location)
pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
ImportError
:
Recent flash-attn releases encounter errors importing rotary embed. You may need to copy Dao-AILab's ops/triton
directory into the flash-attn distribution you installed to site-packages.
This repository is itself MIT-licensed.
Includes:
- MIT-licensed code copied from Artidoro Pagnoni's qlora
- MIT-licensed code copied from Scott Logic's qlora fork (specifically
evaluate.py
). - Apache-licensed code copied from MosaicML's mpt-7b-chat Huggingface Space