/nnl

a low-latency and high-performance inference engine for large models on low-memory GPU platform.

Primary LanguageC++

NeuralNet Logic


NNL is an inference engine for large models on low-memory GPU platform.

Contents

Introduction

Big models are too large to fit into the GPU memory. NNL addresses this problem with a trade-off between PCIE bandwidth and memory.

A typical inference pipeline is as follows:

  1. compose the computation graph using a model with $n$ nodes
  2. topological sort each node in the computation graph to make a computation table
  3. for i in [1, 2, 3, ..., n]:
    • execute the following tasks asynchronously
      • compute the output of node i
      • load the weights to GPU for node i+1
      • allocate the GPU memory (output tensor and cahces) for node i+1
      • deallocate the GPU memory (output tensors, weights and caches) for node i-1

With GPU memory pool and memory defragmentation, NNIL makes it possible to inference a large model on a low-end GPU platform.

Build the library

This is just a hobby project written up in a few weeks, currently only CUDA backend is supported.

Tested with

  • gcc 13.2.1
  • cuda 12.2
  • cudnn 8.9.2.26

Build the static library

make libnnl_cuda.a && make libnnl_cuda_kernels.a

This command will build the two static libraries: lib/libnnl_cuda.a and lib/libnnl_cuda_kernels.a. The first one is the core library with CUDA backend in C++, and the second one is for the CUDA kernels.

GPT2-XL Example

A demo program of GPT2-XL (1.6B) is provided here. This program can be compiled by this command:

make gpt2_1558m

After downloading all the weights from the release, we can run the following command on a low-end GPU platform such as GTX 1050 (2 GB memory):

./bin/gpt2_1558m --max_len 20  "Hi. My name is Feng and I am a machine learning engineer"

And the output is like this:

Disclaimer: this is just an example generated by gpt2-xl, I am not working at Google and I do not know Randi.

And you can find the GPU memory access pattern

Roadmap

  • int8 support
  • more layers
  • more example applications
  • weight persistence in case of a small model

License

PeaceOSL

Acknowledgements

Why NNL?