Join the community | Contribute to the library
How nebullvm works • Benchmarks • Installation • Get started
nebullvm
speeds up AI inference by 2-30x in just a few lines of code 🚀
- How nebullvm works
- Benchmarks
- Installation
- Get started
- Pytorch, TensorFlow, Hugging Face and ONNX APIs.
This open-source library takes your AI model as input and outputs an optimized version that runs 2-30 times faster on your hardware. Nebullvm tests multiple optimization techniques (deep learning compilers, quantization, sparsity, distillation, and more) to identify the optimal way to execute your AI model on your specific hardware.
nebullvm
can speed up your model 2 to 10 times without loss of performance, or up to 30 times if you specify that you are willing to trade off a self-defined amount of accuracy/precision for a super-low latency and a lighter model.
The goal of nebullvm
is to let any developer benefit from the most advanced inference optimization techniques without having to spend countless hours understanding, installing, testing and debugging these powerful technologies.
Do you want to learn more about how nebullvm optimizes your model? Take a look at the documentation.
🚀 Superfast. nebullvm speeds up the response time of AI models to enable real-time AI applications with reduced computing cost and low power consumption.
☘️ Easy-to-use. It takes a few lines of code to install the library and optimize your models.
💻 Deep learning model agnostic. nebullvm
supports all the most popular architectures such as transformers, LSTMs, CNNs and FCNs.
🔥 Framework agnostic. nebullvm
supports the most widely used frameworks and provides as output an optimized version of your model with the same interface. At present, nebullvm supports PyTorch, TensorFlow, Hugging Face and ONNX models.
🤖 Hardware agnostic. The library now works on most CPUs and GPUs. If you activate the TVM compiler, nebullvm will also support TPUs and other deep learning-specific ASICs.
✨ Leveraging the best optimization techniques. There are many inference optimization techniques such as deep learning compilers, quantization, half-precision or distillation, which are all meant to optimize the way your AI models run on your hardware. It would take developers countless hours to install and test them on every model deployment. nebullvm
does that for you.
Do you like the concept? Leave a ⭐ if you enjoy the project and join the Discord community where we chat about nebullvm
and AI optimization. And happy acceleration 🚀🚀
We have tested nebullvm
on popular AI models and hardware from leading vendors.
The table below shows the inference speedup provided by nebullvm
. The speedup is calculated as the response time of the unoptimized model divided by the response time of the accelerated model, as an average over 100 experiments. As an example, if the response time of an unoptimized model was on average 600 milliseconds and after nebullvm
optimization only 240 milliseconds, the resulting speedup is 2.5x times, meaning 150% faster inference.
A complete overview of the experiment and findings can be found on in the documentation.
M1 Pro | Intel Xeon | AMD EPYC | Nvidia T4 | |
---|---|---|---|---|
EfficientNetB0 | 23.3x | 3.5x | 2.7x | 1.3x |
EfficientNetB2 | 19.6x | 2.8x | 1.5x | 2.7x |
EfficientNetB6 | 19.8x | 2.4x | 2.5x | 1.7x |
Resnet18 | 1.2x | 1.9x | 1.7x | 7.3x |
Resnet152 | 1.3x | 2.1x | 1.5x | 2.5x |
SqueezeNet | 1.9x | 2.7x | 2.0x | 1.3x |
Convnext tiny | 3.2x | 1.3x | 1.8x | 5.0x |
Convnext large | 3.2x | 1.1x | 1.6x | 4.6x |
GPT2 - 10 tokens | 2.8x | 3.2x | 2.8x | 3.8x |
GPT2 - 1024 tokens | - | 1.7x | 1.9x | 1.4x |
Bert - 8 tokens | 6.4x | 2.9x | 4.8x | 4.1x |
Bert - 512 tokens | 1.8x | 1.3x | 1.6x | 3.1x |
____________________ | ____________ | ____________ | ____________ | ____________ |
Overall, the library provides great results, with more than 2x acceleration in most cases and around 20x in a few applications. We can also observe that acceleration varies greatly across different hardware-model couplings, so we suggest you test nebullvm
on your model and hardware to assess its full potential on your specific use case.
Besides, across all scenarios, nebullvm
is very helpful for its ease of use, allowing you to take advantage of inference optimization techniques without having to spend hours studying, testing and debugging these powerful technologies.
The installation consists of two steps:
There are two ways to install nebullvm
:
- Using PyPI. We suggest installing the library with pip to get the stable version of nebullvm
- From source code to get the latest features
The easiest way to install nebullvm
is by using pip
, running
pip install nebullvm
Alternatively, you can install nebullvm from source code by cloning the directory on your local machine
using git
.
git clone https://github.com/nebuly-ai/nebullvm.git
Then, enter the repo and install nebullvm
with pip
.
cd nebullvm
pip install .
Follow the instructions below to automatically install all deep learning compilers leveraged by nebullvm (OpenVINO, TensorRT, ONNX Runtime, Apache TVM, etc.).
To install them, there are thee ways:
- Installation at the first optimization run
- Installation before the first optimization run (recommended)
- Download Docker images with preinstalled compilers
Note that:
- Apache TVM is not installed with the below instructions. TVM can be installed separately by following this guide.
- As an alternative to automatic installation of all compilers, they can be selectively installed by following these instructions.
The automatic installation of the deep learning compilers is activated after you import nebullvm
and perform your first optimization. You may run into import errors related to the deep learning compiler installation, but you can ignore these errors/warnings. It is also recommended re-starting the python kernel between the auto-installation and the first optimization, otherwise not all compilers will be activated.
To avoid any problems, we strongly recommend running the auto-installation before performing the first optimization by running
python -c "import nebullvm"
You should ignore at this stage any import warning resulting from the previous command.
Instead of installing the compilers, which may take a long time, you can simply download the docker container with all compilers preinstalled and start using nebullvm. To pull the docker image you can simply run
docker pull nebulydocker/nebullvm:cuda11.2.0-nebullvm0.3.1-allcompilers
and you can then run and access the docker with
docker run -ia nebulydocker/nebullvm:cuda11.2.0-nebullvm0.3.1-allcompilers
After you have compiled the model, you may decide to deploy it to production. Note that some of the components used to optimize the model are also needed to run it, so you must have the compiler installed in the production docker. For this reason, we have created several versions of our Docker container in the Docker Hub, each containing only one compiler. Pull the image with the compiler that has optimized your model!
nebullvm
reduces the computation time of deep learning model inference by 2-30 times by testing multiple optimization techniques and identifying the optimal way to execute your AI model on your hardware.
nebullvm
can be deployed in two ways:
- Option A: 2-10x acceleration, NO performance loss
- Option B: 2-30x acceleration, supervised performance loss
For a detailed explanation of how nebullvm works and how to use it, refer to the documentation.
If you choose this option, nebullvm
will test multiple deep learning compilers (TensorRT, OpenVINO, ONNX Runtime, etc.) and identify the optimal way to compile your model on your hardware, increasing inference speed by 2-10 times without affecting the performance of your model.
As an example, below is code for accelerating a PyTorch model with nebullvm's PyTorch API.
>>> import torch
>>> import torchvision.models as models
>>> from nebullvm import optimize_torch_model
>>> model = models.efficientnet_b0()
>>> save_dir = "."
>>> bs, input_sizes = 1, [(3, 256, 256)]
>>> optimized_model = optimize_torch_model(
... model, batch_size=bs, input_sizes=input_sizes, save_dir=save_dir
... )
>>> x = torch.randn(1, 3, 256, 256)
>>> res = optimized_model(x)
Nebullvm
is capable of speeding up inference by much more than 10 times in case you are willing to sacrifice a fraction of your model's performance. If you specify how much performance loss you are willing to sustain, nebullvm
will push your model's response time to its limits by identifying the best possible blend of state-of-the-art inference optimization techniques, such as deep learning compilers, distillation, quantization, half-precision, sparsity, etc.
Check out the documentation for more information on nebullvm APIs, how to use them, and for tutorials. Also find more information on how to contribute to the library and share feedback to support its continuous improvement.
And leave a star ⭐ to support the project 💫
Join the community | Contribute to the library
How nebullvm works • Benchmarks • Installation • Get started