Byseqlib: A High Performance Inference Library for Sequence Processing and Generation

Byseqlib is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transformer, etc. It is therefore best useful for Machine Translation, Text Generation, Dialog, Language Modelling, and other related tasks using these models.

The library is built on top of CUDA official library(cuBLAS, Thrust, CUB) and custom kernel functions which are specially fused and optimized for these widely used models. In addition to model components, we also provide codes manage model weights trained from deepleanring framework and servers as a custom backend for TensorRT Inference Server(referred to as trtis in the later discussion). With Byseqlib, you can easily deploy efficient model services or develop your own model architectures just with a little code modification.

Features

Currently supports BERT, Transformer(with beam search) and GPT-2 language model.
Out-of-the-box end-to-end model server based on TRTIS.
In addition to FP32, FP16 inference is also supported with no loss of accuracy even when the model weight is in FP32.
High inference performance compared with TensorFlow(8x+ speedup on Transformer with beam search, 4x+ speedup on GPT-2 LM).
One GPU stream per model, so efficient multi-model on single GPU.

Performance

We show our experiment on a fr2en translation model which is a Transformer-big with a beam size of 4 and a target vocabulary size of approximately 30k. The implementation from tensor2tensor was used as the benchmark of tf-transformer. Due to the lack of tf-beam-search in the fp16 version, we only tested the fp32 version of the tf-transformer for fair comparison.

The following table is a comparison of Byseqlib and TensorFlow tested on Tesla P4 and Tesla T4. To save space, we only show the results of batch_size = 8. More results is available here.

batch_size	seq_len	tf-fp32-p4, ms	byseq-fp32-p4, ms	byseq-fp16-t4, ms	byseq-fp32-p4/tf-fp32-p4, speedup	byseq-fp16-t4/byseq-fp32-p4, speedup	byseq-fp16-t4/tf-fp32-p4, speedup
8	6	364	76	43	4.78	1.77	8.47
	12	470	110	56	4.27	1.96	8.39
	18	854	205	91	4.16	2.25	9.38
	24	1381	318	139	4.34	2.29	9.94
	36	1628	378	156	4.3	2.42	10.44
	46	1989	459	193	4.33	2.38	10.31
	58	2683	617	254	4.34	2.43	10.56
	70	4251	949	382	4.47	2.48	11.13

Code Structure

├── BUILD # bazel build file
├── 3rdparty
│   └── cub-1.8.0 # CUB lib
├── kernels # cuda kernel function
│   ├── common.h  # common function
│   ├── gptKernels.cu.cc # kernel function needed by gpt
│   ├── gptKernels.h
│   ├── transformerKernels.cu.cc # kernel function needed by transformer
│   └── transformerKernels.h
├── model # model infer component
│   ├── decoder.cu.cc # transformer decoder
│   ├── decoder.h 
│   ├── encoder.cu.cc # transformer encoder
│   ├── encoder.h
│   ├── gpt_encoder.cu.cc # gpt
│   └── gpt_encoder.h
├── proto # proto for model weights
│   ├── gpt.proto
│   ├── gpt_weight.cu.cc # model weights loader
│   ├── gpt_weight.h
│   ├── transformer.proto
│   ├── transformer_weight.cu.cc # model weights loader
│   └── transformer_weight.h
├── example # local inference demo
│   ├── gptlm_example.cu.cc # gptlm demo
│   └── transformer_example.cu.cc # transformer demo
├── server # model inference server based on trtis
│   ├── generate_server.cu.cc # transfomer genearate server, multi-target for one source
│   ├── gptlm_server.cu.cc # gptlm server
│   └── transformer_server.cu.cc # transfomer server, one target for one source
└── tools # development tools. e.g. runtime guard, debug

Requirements

Install Docker and nvidia-docker.
GPU driver version >= 410.48
Login to the NGC registry.

Quick Start

To avoid problems caused by inconsistent environments, you can use the pre-built trtis container from NVIDIA GPU Cloud (NGC). To start the given container, you need to install nvidia-docker and make your GPU driver version >= 410.48

docker pull nvcr.io/nvidia/tensorrtserver:19.05-py3
docker run --gpus '"device=0"' -it --rm -p8000:8000 -p8001:8001 -p8002:8002 -v
/${current}/${path}:/quick_start nvcr.io/nvidia/tensorrtserver:19.05-py3 /bin/bash
# inside container
cd /quick_start

Use our pre-build lib

To quickly deploy your model that supported by Byseqlib currently, you can download the pre-built libraries from the GitHub release page corresponding to the release version you are interested in. In each release version, we will upload binary executable example and dynamic link library of models which is a custom backend of trtis.

wget https://github.com/bytedance/byseqlib/releases/download/${VERSION}/${VERSION}_libs.tar.gz
tar -zxvf ${VERSION}_libs.tar.gz

Run local inference demo

To run local inference demo, you need to prepare model weights saved in custom proto defined by Byseqlib and input token ids. We provide a GPT-LM model and its corresponding input token ids:

wget https://github.com/bytedance/byseqlib/releases/download/v0.0.1/v0.0.1_gptlm.pkg.tar.gz
tar -zxvf v0.0.1_gptlm.pkg.tar.gz
# fp32 example
./{VERSION}_libs/gptlm_example.fp32 ./v0.0.1_gptlm.pkg/gpt.pb ./v0.0.1_gptlm.pkg/test_case
# fp16 example
./{VERSION}_libs/gptlm_example.fp16 ./v0.0.1_gptlm.pkg/gpt.pb ./v0.0.1_gptlm.pkg/test_case

Run inference server

To run the end-to-end model server based on trtis, you need to prepare a custom backend model repository like this:

models/
  <model-name>/
    config.pbtxt # configuration
    xxx # model weights
    1/
      libyyy.so # custom dynamic link library

With the pre-built libraries and example weights mentioned above, you can easily run a server:

mkdir -p ./model_zoo/gptlm/1
wget https://github.com/bytedance/byseqlib/releases/download/v0.0.1/v0.0.1_gptlm.config.pbtxt
mv v0.0.1_gptlm.config.pbtxt model_zoo/gptlm/config.pbtxt
cp ./v0.0.1_gptlm.pkg/gpt.pb model_zoo/gptlm/gpt.pb
cp ./{VERSION}_libs/libgptlm.so.fp32 model_zoo/gptlm/1/libgptlm.so
# or fp16 server
# cp ./{VERSION}_libs/libgptlm.so.fp16 model_zoo/gptlm/1/libgptlm.so
export MODEL_ZOO="/quick_start/model_zoo"
trtserver --model-store=${MODEL_ZOO}

After starting server, Invoking the trtis client will get the inference result.

Serve your own model

In order to serve your own model, you need to export model trained from deeplearning framework(E.g. TenforFlow, PyTorch) to custom model proto defined by Byseqlib. Furthermore, you may need to build from source code if you want to modify the model architectures or serve a new model not supported by Byseqlib currently.

Limitations and Future Plans

Byseqlib does not support CPU inference for now and its compilation relies heavily on trtis, we will try to solve these problems in future. Furthermore, the following will be the focus of our future work:

Support more model architectures and decoding search algorithms.
Int8 inference.
Device deployment.

Contact

Any questions or suggestions, please feel free to contact us with wangxiaohui.neo@bytedance.com

moolighty/byseqlib