/byseqlib

Byseqlib: A High Performance Inference Library for Sequence Processing and Generation

Primary LanguageC++OtherNOASSERTION

Byseqlib: A High Performance Inference Library for Sequence Processing and Generation

Byseqlib is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transformer, etc. It is therefore best useful for Machine Translation, Text Generation, Dialog, Language Modelling, and other related tasks using these models.

The library is built on top of CUDA official library(cuBLAS, Thrust, CUB) and custom kernel functions which are specially fused and optimized for these widely used models. In addition to model components, we also provide codes manage model weights trained from deepleanring framework and servers as a custom backend for TensorRT Inference Server(referred to as trtis in the later discussion). With Byseqlib, you can easily deploy efficient model services or develop your own model architectures just with a little code modification.

Features

  • Currently supports BERT, Transformer(with beam search) and GPT-2 language model.
  • Out-of-the-box end-to-end model server based on TRTIS.
  • In addition to FP32, FP16 inference is also supported with no loss of accuracy even when the model weight is in FP32.
  • High inference performance compared with TensorFlow(8x+ speedup on Transformer with beam search, 4x+ speedup on GPT-2 LM).
  • One GPU stream per model, so efficient multi-model on single GPU.

Performance

We show our experiment on a fr2en translation model which is a Transformer-big with a beam size of 4 and a target vocabulary size of approximately 30k. The implementation from tensor2tensor was used as the benchmark of tf-transformer. Due to the lack of tf-beam-search in the fp16 version, we only tested the fp32 version of the tf-transformer for fair comparison.

The following table is a comparison of Byseqlib and TensorFlow tested on Tesla P4 and Tesla T4. To save space, we only show the results of batch_size = 8. More results is available here.

batch_size seq_len tf-fp32-p4, ms byseq-fp32-p4, ms byseq-fp16-t4, ms byseq-fp32-p4/tf-fp32-p4, speedup byseq-fp16-t4/byseq-fp32-p4, speedup byseq-fp16-t4/tf-fp32-p4, speedup
8 6 364 76 43 4.78 1.77 8.47
12 470 110 56 4.27 1.96 8.39
18 854 205 91 4.16 2.25 9.38
24 1381 318 139 4.34 2.29 9.94
36 1628 378 156 4.3 2.42 10.44
46 1989 459 193 4.33 2.38 10.31
58 2683 617 254 4.34 2.43 10.56
70 4251 949 382 4.47 2.48 11.13

Code Structure

├── BUILD # bazel build file
├── 3rdparty
│   └── cub-1.8.0 # CUB lib
├── kernels # cuda kernel function
│   ├── common.h  # common function
│   ├── gptKernels.cu.cc # kernel function needed by gpt
│   ├── gptKernels.h
│   ├── transformerKernels.cu.cc # kernel function needed by transformer
│   └── transformerKernels.h
├── model # model infer component
│   ├── decoder.cu.cc # transformer decoder
│   ├── decoder.h 
│   ├── encoder.cu.cc # transformer encoder
│   ├── encoder.h
│   ├── gpt_encoder.cu.cc # gpt
│   └── gpt_encoder.h
├── proto # proto for model weights
│   ├── gpt.proto
│   ├── gpt_weight.cu.cc # model weights loader
│   ├── gpt_weight.h
│   ├── transformer.proto
│   ├── transformer_weight.cu.cc # model weights loader
│   └── transformer_weight.h
├── example # local inference demo
│   ├── gptlm_example.cu.cc # gptlm demo
│   └── transformer_example.cu.cc # transformer demo
├── server # model inference server based on trtis
│   ├── generate_server.cu.cc # transfomer genearate server, multi-target for one source
│   ├── gptlm_server.cu.cc # gptlm server
│   └── transformer_server.cu.cc # transfomer server, one target for one source
└── tools # development tools. e.g. runtime guard, debug

Requirements

Quick Start

To avoid problems caused by inconsistent environments, you can use the pre-built trtis container from NVIDIA GPU Cloud (NGC). To start the given container, you need to install nvidia-docker and make your GPU driver version >= 410.48

docker pull nvcr.io/nvidia/tensorrtserver:19.05-py3
docker run --gpus '"device=0"' -it --rm -p8000:8000 -p8001:8001 -p8002:8002 -v
/${current}/${path}:/quick_start nvcr.io/nvidia/tensorrtserver:19.05-py3 /bin/bash
# inside container
cd /quick_start

Use our pre-build lib

To quickly deploy your model that supported by Byseqlib currently, you can download the pre-built libraries from the GitHub release page corresponding to the release version you are interested in. In each release version, we will upload binary executable example and dynamic link library of models which is a custom backend of trtis.

wget https://github.com/bytedance/byseqlib/releases/download/${VERSION}/${VERSION}_libs.tar.gz
tar -zxvf ${VERSION}_libs.tar.gz

Run local inference demo

To run local inference demo, you need to prepare model weights saved in custom proto defined by Byseqlib and input token ids. We provide a GPT-LM model and its corresponding input token ids:

wget https://github.com/bytedance/byseqlib/releases/download/v0.0.1/v0.0.1_gptlm.pkg.tar.gz
tar -zxvf v0.0.1_gptlm.pkg.tar.gz
# fp32 example
./{VERSION}_libs/gptlm_example.fp32 ./v0.0.1_gptlm.pkg/gpt.pb ./v0.0.1_gptlm.pkg/test_case
# fp16 example
./{VERSION}_libs/gptlm_example.fp16 ./v0.0.1_gptlm.pkg/gpt.pb ./v0.0.1_gptlm.pkg/test_case

Run inference server

To run the end-to-end model server based on trtis, you need to prepare a custom backend model repository like this:

models/
  <model-name>/
    config.pbtxt # configuration
    xxx # model weights
    1/
      libyyy.so # custom dynamic link library

With the pre-built libraries and example weights mentioned above, you can easily run a server:

mkdir -p ./model_zoo/gptlm/1
wget https://github.com/bytedance/byseqlib/releases/download/v0.0.1/v0.0.1_gptlm.config.pbtxt
mv v0.0.1_gptlm.config.pbtxt model_zoo/gptlm/config.pbtxt
cp ./v0.0.1_gptlm.pkg/gpt.pb model_zoo/gptlm/gpt.pb
cp ./{VERSION}_libs/libgptlm.so.fp32 model_zoo/gptlm/1/libgptlm.so
# or fp16 server
# cp ./{VERSION}_libs/libgptlm.so.fp16 model_zoo/gptlm/1/libgptlm.so
export MODEL_ZOO="/quick_start/model_zoo"
trtserver --model-store=${MODEL_ZOO}

After starting server, Invoking the trtis client will get the inference result.

Serve your own model

In order to serve your own model, you need to export model trained from deeplearning framework(E.g. TenforFlow, PyTorch) to custom model proto defined by Byseqlib. Furthermore, you may need to build from source code if you want to modify the model architectures or serve a new model not supported by Byseqlib currently.

Limitations and Future Plans

Byseqlib does not support CPU inference for now and its compilation relies heavily on trtis, we will try to solve these problems in future. Furthermore, the following will be the focus of our future work:

  • Support more model architectures and decoding search algorithms.
  • Int8 inference.
  • Device deployment.

Contact

Any questions or suggestions, please feel free to contact us with wangxiaohui.neo@bytedance.com