This repository contains code to run faster feature extractors using tools like quantization, optimization and ONNX
. Just run your model much faster, while using less of memory. There is not much to it!
Phillip Schmid: "We successfully quantized our vanilla Transformers model with Hugging Face and managed to accelerate our model latency from 25.6ms to 12.3ms or 2.09x while keeping 100% of the accuracy on the stsb dataset. But I have to say that this isn't a plug and play process you can transfer to any Transformers model, task or dataset.""
pip install fast-sentence-transformers
Or, for GPU support:
pip install fast-sentence-transformers[gpu]
from fast_sentence_transformers import FastSentenceTransformer as SentenceTransformer
# use any sentence-transformer
encoder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu")
encoder.encode("Hello hello, hey, hello hello")
encoder.encode(["Life is too short to eat bad food!"] * 2)
Non-exact, indicative benchmark for speed an memory usage with smaller and larger model on sentence-transformers
model | Type | default | ONNX | ONNX+quantized | ONNX+GPU |
---|---|---|---|---|---|
paraphrase-albert-small-v2 | memory | 1x | 1x | 1x | 1x |
speed | 1x | 2x | 5x | 20x | |
paraphrase-multilingual-mpnet-base-v2 | memory | 1x | 1x | 4x | 4x |
speed | 1x | 2x | 5x | 20x |
This package heavily leans on https://www.philschmid.de/optimize-sentence-transformers.