To set up the environment for this project, please follow the steps below. The commands will install the required dependencies, including specific versions of PyTorch, Transformers, and other libraries.
- Python 3.8 or higher
- CUDA 12.1 for GPU acceleration (recommended)
-
Install PyTorch and related libraries
First, install PyTorch, Torchvision, and Torchaudio with CUDA support using the following command:
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
-
Install other dependencies
Install the required libraries for Transformers, Accelerate, Datasets, and others:
pip install transformers==4.36.2 pip install accelerate==0.26.1 pip install datasets==2.16.1 pip install einops pip install protobuf pip install sentencepiece pip install typing-extensions
Use the following commands to run various experiments with different models and datasets. Make sure to adjust the CUDA_VISIBLE_DEVICES according to your GPU availability.
-
Static Tree Method
Running Greedy Speculative Decoding with JackFram/llama-68m on CNN Dataset
Command Examples
CUDA_VISIBLE_DEVICES=7 python test_greedyS.py \ --model JackFram/llama-68m \ --target meta-llama/Llama-2-13b-hf \ --T 0.6 \ --P 0.9 \ --start 0 \ --end 200 \ --Mode greedy \ --M 1024 \ --growmap ./growmaps/68m_7b-64.pt \ --dataset cnn
Running Greedy Speculative Decoding with OpenWebText Dataset, dataset you can choose others like openwebtext
-
Common-Sense Token Verfication
Running Greedy Speculative Decoding with Common-Sense Token Verfication on CNN Dataset, it is cool let's try it! It can improve the average acceptance length from about 0.3 to 0.5
Command Examples
CUDA_VISIBLE_DEVICES=7 python test_greedyS_commonsense.py \ --model JackFram/llama-68m \ --target meta-llama/Llama-2-13b-hf \ --T 0.6 \ --P 0.9 \ --start 0 \ --end 200 \ --Mode greedy \ --M 1024 \ --growmap ./growmaps/68m_7b-64.pt \ --dataset cnn
-
Dynamic-Tree Algorithm
It is faster than static tree, especially when temperature is high.
Command Examples
CUDA_VISIBLE_DEVICES=7 python test_greedyS_dy.py \ --model JackFram/llama-68m \ --target meta-llama/Llama-2-13b-hf \ --T 0.6 \ --P 0.9 \ --start 0 \ --end 200 \ --Mode greedy \ --M 1024 \ --growmap ./growmaps/68m_7b-64.pt \ --dataset cnn
-
Run kl Divergency test in advance
We perform a method to test draft model potential in advance using KL Divergency.
Command Examples
CUDA_VISIBLE_DEVICES=7 python test_greedyS_logit.py \ --model JackFram/llama-68m \ --target meta-llama/Llama-2-13b-hf \ --T 0.6 \ --P 0.9 \ --start 0 \ --end 3500 \ --Mode baseline \ --M 1024 \ --growmap ./growmaps/68m_7b-64.pt \ --dataset openwebtext
After running it, you can get a potential score, for the same dataset small score means a strong draft model.
-
Spec-LLava
We perform a method to acclerate llava
First, you can try our draft llava model, which very similar to llava original model, llava-160m and llava-68m, their peformance is good, can be used in some small model scenes, like robotics, driving!
Command Examples
processor = AutoProcessor.from_pretrained("anthonycmu/llava-1.5-160m") processor = AutoProcessor.from_pretrained("anthonycmu/llava-1.5-68m")
Next, you can run the following command to test llava acceleration. You need to first download coco test-2015 dataset, or other dataset, modify dataset path in 64 row of the code, which can acclerate vlm in a good performance!
CUDA_VISIBLE_DEVICES=7 python test_llava.py \ --model anthonycmu/llava-1.5-160m\ --target llava-hf/llava-1.5-7b-hf\ --T 0.6 \ --P 0.9 \ --start 0 \ --end 3500 \ --Mode greedy \ --M 1024 \ --growmap ./growmaps/68m_7b-64.pt \ --dataset openwebtext
-
Finetune or Pretrain your own Draft Model
Let's pretrain your own draft model:
Command Examples
python pretrain_draft_model.py
Next, you can use your own draft model by direct use its local path like './outputs/checkpoint-500/' or push it to huggingface.