CPU Dynamic Quantization

Question

CPU Dynamic Quantization

MiscellaneousStuff opened this issue 2 years ago · 5 comments

MiscellaneousStuff commented 2 years ago

Would it be possible for you guys to add an option to enable dynamic quantization of the model when it's being run on a CPU? This would greatly improve the run-time performance of the OpenAI Whisper model (CPU-only) with minimal to no loss in performance.

The benchmarks for this are available here.

The implementation only requires adding a few lines of code using features which are already built into PyTorch.

Implementation

Quantization of the Whisper model requires changing the Linear()
layers within the model to nn.Linear(). This is because you need
to specifiy which layer types to dynamically quantize, such as:

quantized_model = torch.quantization.quantize_dynamic(
    model_fp32, {torch.nn.Linear}, dtype=torch.qint8
)

However the whisper model is designed to be adaptable, i.e.
it can run at different precisions, so the Linear() layer contains
custom code to account for this. However, this is not required for
the quantized model. You can either change the Linear() layers in
"/whisper/whisper/model.py" yourself (i.e. create a fork of OpenAI-Whisper
which would be compatible with future merges), or you can use
mine from here.

Answer 1 · 2023-02-09T06:43:18.000Z

Could this be done by swapping the whisper packages underneath?
-- pip install openai-whisper
++ pip install git+https://github.com/MiscellaneousStuff/whisper.git

Answer 2 · 2023-02-09T08:22:08.000Z

Yep. That submodule is exactly the same as the original but has swapped the Linear() layer for nn.Linear(). However, it also means that anyone wanting to run the model at half precision on GPU won’t be able to do it, should it only use that custom whisper module for dynamic quantisation on CPU.

Answer 3 · 2023-02-09T09:19:39.000Z

Great! In that case, I'll add it as a note on Readme to swap out whisper for your fork if they intend to run it on a CPU only machine. Thanks!

Answer 4 · 2023-02-09T09:26:46.000Z

Updated Readme here: 0431dee

Answer 5 · 2023-05-24T18:18:19.000Z

Doing what is recommended in the Readme does not work:

Note: If you're using a CPU-only machine, your runtime can be sped-up by using quantization implemented by @MicellaneousStuff by swapping out pip install openai-whisper from requirements.txt and replacing it with their fork pip install git+https://github.com/MiscellaneousStuff/whisper.git (See related discussion here - #20)

what exactly has to be put in the requirements.txt?