Join the community | Contribute to the library

How nebullvm works • Benchmarks • Installation • Get started

Nebullvm

nebullvm speeds up AI inference by 2-30x in just a few lines of code 🚀

How nebullvm works
Benchmarks
Installation
Get started
Pytorch, TensorFlow, Hugging Face and ONNX APIs.

How nebullvm works

This open-source library takes your AI model as input and outputs an optimized version that runs 2-30 times faster on your hardware. Nebullvm tests multiple optimization techniques (deep learning compilers, quantization, sparsity, distillation, and more) to identify the optimal way to execute your AI model on your specific hardware.

nebullvm can speed up your model 2 to 10 times without loss of performance, or up to 30 times if you specify that you are willing to trade off a self-defined amount of accuracy/precision for a super-low latency and a lighter model.

The goal of nebullvm is to let any developer benefit from the most advanced inference optimization techniques without having to spend countless hours understanding, installing, testing and debugging these powerful technologies.

Do you want to learn more about how nebullvm optimizes your model? Take a look at the documentation.

So why nebullvm?

🚀 Superfast. nebullvm speeds up the response time of AI models to enable real-time AI applications with reduced computing cost and low power consumption.

☘️ Easy-to-use. It takes a few lines of code to install the library and optimize your models.

💻 Deep learning model agnostic. nebullvm supports all the most popular architectures such as transformers, LSTMs, CNNs and FCNs.

🔥 Framework agnostic. nebullvm supports the most widely used frameworks and provides as output an optimized version of your model with the same interface. At present, nebullvm supports PyTorch, TensorFlow, Hugging Face and ONNX models.

🤖 Hardware agnostic. The library now works on most CPUs and GPUs. If you activate the TVM compiler, nebullvm will also support TPUs and other deep learning-specific ASICs.

✨ Leveraging the best optimization techniques. There are many inference optimization techniques such as deep learning compilers, quantization, half-precision or distillation, which are all meant to optimize the way your AI models run on your hardware. It would take developers countless hours to install and test them on every model deployment. nebullvm does that for you.

Do you like the concept? Leave a ⭐ if you enjoy the project and join the Discord community where we chat about nebullvm and AI optimization. And happy acceleration 🚀🚀

Benchmarks

We have tested nebullvm on popular AI models and hardware from leading vendors.

The table below shows the inference speedup provided by nebullvm. The speedup is calculated as the response time of the unoptimized model divided by the response time of the accelerated model, as an average over 100 experiments. As an example, if the response time of an unoptimized model was on average 600 milliseconds and after nebullvm optimization only 240 milliseconds, the resulting speedup is 2.5x times, meaning 150% faster inference.

A complete overview of the experiment and findings can be found on in the documentation.

	M1 Pro	Intel Xeon	AMD EPYC	Nvidia T4
EfficientNetB0	23.3x	3.5x	2.7x	1.3x
EfficientNetB2	19.6x	2.8x	1.5x	2.7x
EfficientNetB6	19.8x	2.4x	2.5x	1.7x
Resnet18	1.2x	1.9x	1.7x	7.3x
Resnet152	1.3x	2.1x	1.5x	2.5x
SqueezeNet	1.9x	2.7x	2.0x	1.3x
Convnext tiny	3.2x	1.3x	1.8x	5.0x
Convnext large	3.2x	1.1x	1.6x	4.6x
GPT2 - 10 tokens	2.8x	3.2x	2.8x	3.8x
GPT2 - 1024 tokens	-	1.7x	1.9x	1.4x
Bert - 8 tokens	6.4x	2.9x	4.8x	4.1x
Bert - 512 tokens	1.8x	1.3x	1.6x	3.1x
____________________	____________	____________	____________	____________

Overall, the library provides great results, with more than 2x acceleration in most cases and around 20x in a few applications. We can also observe that acceleration varies greatly across different hardware-model couplings, so we suggest you test nebullvm on your model and hardware to assess its full potential on your specific use case.

Besides, across all scenarios, nebullvm is very helpful for its ease of use, allowing you to take advantage of inference optimization techniques without having to spend hours studying, testing and debugging these powerful technologies.

Installation

The installation consists of two steps:

nebullvm installation
Installation of deep learning compilers

Nebullvm installation

There are two ways to install nebullvm:

Using PyPI. We suggest installing the library with pip to get the stable version of nebullvm
From source code to get the latest features

Installation with PyPI (recommended)

The easiest way to install nebullvm is by using pip, running

pip install nebullvm

Installation from source code

Alternatively, you can install nebullvm from source code by cloning the directory on your local machine using git.

git clone https://github.com/nebuly-ai/nebullvm.git

Then, enter the repo and install nebullvm with pip.

cd nebullvm
pip install .

Installation of deep learning compilers

Follow the instructions below to automatically install all deep learning compilers leveraged by nebullvm (OpenVINO, TensorRT, ONNX Runtime, Apache TVM, etc.).

To install them, there are thee ways:

Installation at the first optimization run
Installation before the first optimization run (recommended)
Download Docker images with preinstalled compilers

Note that:

Apache TVM is not installed with the below instructions. TVM can be installed separately by following this guide.
As an alternative to automatic installation of all compilers, they can be selectively installed by following these instructions.

Installation at the first optimization run

The automatic installation of the deep learning compilers is activated after you import nebullvm and perform your first optimization. You may run into import errors related to the deep learning compiler installation, but you can ignore these errors/warnings. It is also recommended re-starting the python kernel between the auto-installation and the first optimization, otherwise not all compilers will be activated.

Installation before the first optimization run (recommended)

To avoid any problems, we strongly recommend running the auto-installation before performing the first optimization by running

python -c "import nebullvm"

You should ignore at this stage any import warning resulting from the previous command.

Download Docker images with preinstalled compilers

Instead of installing the compilers, which may take a long time, you can simply download the docker container with all compilers preinstalled and start using nebullvm. To pull the docker image you can simply run

docker pull nebulydocker/nebullvm:cuda11.2.0-nebullvm0.3.1-allcompilers

and you can then run and access the docker with

docker run -ia nebulydocker/nebullvm:cuda11.2.0-nebullvm0.3.1-allcompilers

After you have compiled the model, you may decide to deploy it to production. Note that some of the components used to optimize the model are also needed to run it, so you must have the compiler installed in the production docker. For this reason, we have created several versions of our Docker container in the Docker Hub, each containing only one compiler. Pull the image with the compiler that has optimized your model!

Get started, APIs and tutorials

nebullvm reduces the computation time of deep learning model inference by 2-30 times by testing multiple optimization techniques and identifying the optimal way to execute your AI model on your hardware.

nebullvm can be deployed in two ways:

Option A: 2-10x acceleration, NO performance loss
Option B: 2-30x acceleration, supervised performance loss

For a detailed explanation of how nebullvm works and how to use it, refer to the documentation.

Option A: 2-10x acceleration, NO performance loss

If you choose this option, nebullvm will test multiple deep learning compilers (TensorRT, OpenVINO, ONNX Runtime, etc.) and identify the optimal way to compile your model on your hardware, increasing inference speed by 2-10 times without affecting the performance of your model.

As an example, below is code for accelerating a PyTorch model with nebullvm's PyTorch API.

>>> import torch
>>> import torchvision.models as models
>>> from nebullvm import optimize_torch_model
>>> model = models.efficientnet_b0()
>>> save_dir = "."
>>> bs, input_sizes = 1, [(3, 256, 256)]
>>> optimized_model = optimize_torch_model(
... model, batch_size=bs, input_sizes=input_sizes, save_dir=save_dir
... )
>>> x = torch.randn(1, 3, 256, 256)
>>> res = optimized_model(x)

Option B: 2-30x acceleration, supervised performance loss

Nebullvm is capable of speeding up inference by much more than 10 times in case you are willing to sacrifice a fraction of your model's performance. If you specify how much performance loss you are willing to sustain, nebullvm will push your model's response time to its limits by identifying the best possible blend of state-of-the-art inference optimization techniques, such as deep learning compilers, distillation, quantization, half-precision, sparsity, etc.

Check out the documentation for more information on nebullvm APIs, how to use them, and for tutorials. Also find more information on how to contribute to the library and share feedback to support its continuous improvement.

And leave a star ⭐ to support the project 💫