Join the community | Contribute to the library
How nebulgym works • Benchmarks • Installation • Get started • Tutorials & examples • Documentation
Nebulgym
Easy-to-use library to accelerate AI training leveraging state-of-the-art optimization techniques
- How nebulgym works
- Benchmarks
- Tutorials and examples
- Installation & get started
- Documentation
- Join the community for AI acceleration
How nebulgym works
nebulgym
greatly reduces the training time of AI models without requiring any modification to the training setup.nebulgym
optimizes the full training computing stack, from efficient data loading, to faster forward and backward passes, to earlier convergence.
No matter what model, framework, or training recipe you use, with nebulgym
you speed up training by simply adding nebulgym class decorators to your code. Decorators will make sure that you use your hardware's computing power to the fullest and achieve the shortest possible training time.
Your code + @nebulgym_class_decorators = superfast training
So why nebulgym?
nebulgym
class decorators to your code and continue programming on your favorite training framework. nebulgym will let you achieve awesome training times.
nebulgym
can be coupled with any model, trainer, or other training technique to achieve a compound effect on training performance.
nebulgym
supports all the most popular architectures such as transformers, LSTMs, CNNs and FCNs.
Do you like the library? Leave a nebulgym
and AI acceleration.
And learn about the technology behind nebulgym in the documentation!
Happy training
Benchmarks
nebulgym
has just been launched and has been tested on limited use cases. Early results are remarkably good, and it is expected that nebulgym
will further reduce training time in future releases. At the same time, it is expected that nebulgym may fail in untested cases and provide different results, perhaps greater or worse than those shown below.
We tested nebulgym on the custom model that you can find in the example section. The test consists in a training over 10 epochs and a batch size of 8.
Below are the training times in seconds before nebulgym
optimization and after its acceleration, and the speedup, which is calculated as the response time of the unoptimized model divided by the response time of the accelerated model.
Training time in seconds
Hardware | Not-optimized | Accelerated | Speedup |
---|---|---|---|
M1 Pro | 632.05 | 347.52 | 1.8x |
Intel Xeon | 788.05 | 381.01 | 2.1x |
AMD EPYC | 1547.35 | 1034.37 | 1.5x |
NVIDIA T4 | 258.88 | 127.32 | 2.0x |
__________________ | _________________ | _________________ | _________________ |
Hardware setup
- M1 Pro: Apple M1 Pro 16GB of RAM
- Intel Xeon: EC2 Instance on AWS - t2.large
- AMD EPYC: EC2 Instance on AWS - t4a.large
- NVIDIA T4: EC2 instance on AWS - g4dn.xlarge
How does nebulgym
perform on your training setup? What do you think about nebulgym
and what are ways to make it even better? Share your ideas and results with us in the community chat.
Installation
Installing and using nebulgym is super easy! You can either
- install
nebulgym
from PyPI (withpip
) or - install
nebulgym
from source code.
We strongly recommend that you install nebulgym
in a new environment. You can create and manage your environment using Conda or another virtual environment management application. We tested our installation with venv
by conda
.
Installation from PyPI
pip install nebulgym
Source code installation
Clone nebulgym repository to your local machine.
git clone https://github.com/nebuly-ai/nebulgym.git
Go into the repo and run the setup.py file.
cd nebulgym && python setup.py install
Get started
nebulgym
accelerates training by means of class decorators. A class decorator is a very elegant and non-intrusive method that allows nebulgym to tag your model (@accelerate_model
) and your dataset (@accelerate_dataset
) and add functionalities to their classes. When you run a training session, nebulgym
will greatly reduce the training time of your decorated model. As simple as that!
You can find more information about nebulgym
class decorators, the parameters they can take as input, and other nebulgym
classes that can be used as an alternative to decorators in the documentation.
How to use nebulgym class decorators
Put nebulgym
class decorators right before defining your dataset and model classes.
@accelerate_dataset
must be entered before the dataset definition. nebulgym will cache dataset samples in memory, so that reading these samples after the first time becomes much faster. Caching the dataset makes data loading faster and more efficient, solving what could become the main bottleneck of the whole training process.@accelerate_model
must be entered before the model definition. nebulgym will accelerate both forward and backward propagations by reducing the number of computationally expensive propagation steps and making computations more efficient.
nebulgym use case
Here we show an example of how you can easily use nebulgym
class decorators. To achieve awesome training speed, you can simply add nebulgym
decorators (@accelerate_model
and @accelerate_dataset
) before defining your AI model and dataset.
from typing import List, Callable
import torch
from torch.utils.data import Dataset
from nebulgym.decorators.torch_decorators import accelerate_model, accelerate_dataset
# Add nebulgym annotation before defining your model.
# This model takes as input an image of resolution 224x224
@accelerate_model()
class CustomModel(torch.nn.Module):
def __init__(self):
super().__init__()
self._avg_pool = torch.nn.AvgPool2d(4)
self._linear = torch.nn.Linear(3136, 1024)
self._relu = torch.nn.ReLU()
self._linears = torch.nn.Sequential(
torch.nn.BatchNorm1d(1024),
torch.nn.Linear(1024, 2048),
torch.nn.ReLU(),
torch.nn.BatchNorm1d(2048),
torch.nn.Linear(2048, 1024),
torch.nn.ReLU(),
torch.nn.BatchNorm1d(1024),
torch.nn.Linear(1024, 512),
torch.nn.ReLU(),
torch.nn.BatchNorm1d(512),
torch.nn.Linear(512, 256),
torch.nn.ReLU(),
torch.nn.Linear(256, 2),
)
def forward(self, x):
x = self._avg_pool(x).mean(dim=-3).view(-1, 3136)
x = self._relu(self._linear(x))
return self._linears(x)
# Add nebulgym annotation before defining your dataset.
@accelerate_dataset()
class CustomDataset(Dataset):
def __init__(self, img_paths: List[str], labelling_func: Callable, reading_func: Callable):
self._images = img_paths
self._labelling_func = labelling_func
self._reading_func = reading_func
def __getitem__(self, item):
img_path = self._images[item]
label = self._labelling_func(img_path)
input_tensor = self._reading_func(img_path)
return input_tensor, label
def __len__(self):
return len(self._images)
And that's it. Now, as soon as you perform a training run, nebulgym
will optimize the full training computing stack, from efficient data loading, to faster forward and backward passes, to earlier convergence.
Supported tech & roadmap
nebulgym
has just been launched, and it is already capable of cutting training time in half. At the same time, it is expected that nebulgym
may crash or fail in untested use cases. Moreover, the project is in its early stages and there is a lot of room for improvement for nebulgym
to become a new paradigm for artificial intelligence training.
nebulgym
aims to support every framework, every model, every hardware, and make the most of your hardware and software capabilities to train your model in a fraction of the time required now. In addition, nebulgym
will always be extremely easy to use to empower any developer to build powerful AI applications.
nebulgym
already embeds many great technologies. Below you can find a list of the features already implemented and those that will be implemented soon. More specific tasks can be found in the issue page.
Any ideas about what could be implemented next? Would you like to contribute to this fantastic library? We welcome any ideas, questions, issues and pull requests! For more info go to the Documentation.
Supported frameworks
- PyTorch
- TensorFlow (open issue)
Supported backends
- PyTorch. Default compiler for models trained in PyTorch.
- Rammer. Compiler that can be used on Nvidia GPUs.
- ONNX Runtime. Training API leveraging on some techniques developed for inference optimization. It currently supports only Nvidia GPUs.
Optimization techniques for data loading
- Cached datasets. Data loading is for some use cases slow and could become a major bottleneck of the whole training process. nebulgym provides cached dataset to speedup this process by caching data samples in memory, so that reading these samples after the first time becomes much faster.
Model Optimization techniques
- Sparsified Back Propagation. As traditional neural network consumes a significant amount of computing resources during back propagation, leveraging a simple yet effective technique to alleviate this problem. In this technique, only a small subset of the full gradients are computed to update the model parameters and the model still achieves the same effect as the traditional CNN, or even better.
- Selective-Backprop. (open issue)
- Layer Replacement (open issue)
- ModelReshaper (open issue)
- Distributed training (open issue)
- Forward gradients (open issue)
Library installation methods
- From PyPI
- Source code
Backend installation methods
- From the backend source code
- Automatic installation with an auto-installer (open issue)
Licence
This project is released under the Apache 2.0 licence.
Join the community | Contribute to the library
How nebulgym works • Benchmarks • Installation • Get started • Tutorials & examples • Documentation