Why is the inference so slow?
cckao opened this issue · 3 comments
Hi,
Unlimiformer is amazing and can really help me. However, the inference is so slow that I believe I might do something wrong. Please help me. Thank you.
The task was pretty simple. I asked the LM to optimize following Python codes:
# bad_python_codes.py
total = 0
total += 0
total += 1
total += 2
total += 3
total += 4
I run vanilla text generation with following command and model.generate(...)
took 3 seconds to complete:
python run_generation.py \
--model_type llama \
--model_name_or_path /path/to/CodeLlama-13b-Instruct-hf \
--prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Optimize following Python codes: " \
--prompt bad_python_codes.py \
--suffix " [/INST]" \
--test_unlimiformer False \
--fp16 \
--length 10 \
--use_datastore False \
While I enable Unlimiformer, model.generate(...)
took 1 minute and 20 seconds to complete:
python run_generation.py \
--model_type llama \
--model_name_or_path /path/to/CodeLlama-13b-Instruct-hf \
--prefix "<s>[INST] <<SYS>>\n You are a helpful assistant. Answer with detailed responses according to the entire instruction or question. \n<</SYS>>\n\n Optimize following Python codes: " \
--prompt bad_python_codes.py \
--suffix " [/INST]" \
--test_unlimiformer True \
--fp16 \
--length 10 \
--layer_begin 0 \
--index_devices 1 \
--datastore_device 1 \
--use_datastore True \
The FAISS retrieval takes a lot of time, it is being performed at every head and every layer
Hi @cckao and @AshwinRamachandran2002 ,
Thank you for your interest in our work!
Yes, indeed running Unlimiformer is slower.
We found that using --layer_begin X
with a value X
that is at least half the number of layers (that is, if the model has 40 layers, X
should be at least 20) helps both speed and the quality of the output.
Additionally, if your input is not too long (<10k tokens), using --use_datastore False
may speed things up a bit.
Let us know if you have any questions!
Best,
Uri
Hi, @urialon and @AshwinRamachandran2002 ,
Thanks for your comments. --use_datastore False
speeds up a lot.