The Triton backend for the FasterTransformer. This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by Megatron-LM.
Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the backend repo. Ask questions or report problems on the issues page in this FasterTransformer_backend repo.
- Prepare Machine
We provide a docker file, which bases on Triton image nvcr.io/nvidia/tritonserver:21.07-py3
, to setup the environment.
mkdir -p workspace && cd workspace
git clone https://github.com/novatig/fastertransformer_backend.git
nvidia-docker build --tag ft_backend --file fastertransformer_backend/Dockerfile .
nvidia-docker run --gpus=all -it --rm --volume $PWD:/workspace -w /workspace --name ft-work ft_backend
export WORKSPACE=$(pwd)
- Install libraries for Megatron (option)
pip3 install regex fire
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
- Build FT backend
cd $WORKSPACE
git clone https://github.com/triton-inference-server/server.git
export PATH=/usr/local/mpi/bin:$PATH
source fastertransformer_backend/build.env
mkdir -p fastertransformer_backend/build && cd $WORKSPACE/fastertransformer_backend/build
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 .. && make -j32
- Prepare model
git clone https://github.com/NVIDIA/Megatron-LM.git
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir -p models/megatron-models/345m
unzip megatron_lm_345m_v0.0.zip -d models/megatron-models/345m
python _deps/repo-ft-src/sample/pytorch/utils/megatron_ckpt_convert.py -i ./models/megatron-models/345m/release/ -o ./models/megatron-models/c-model/345m/ -t_g 1 -i_g 4 -h_n 16
cp ./models/megatron-models/c-model/345m/4-gpu $WORKSPACE/fastertransformer_backend/all_models/fastertransformer/1/ -r
- Prepare the ft-triton-backend docker
Push a ft-triton-backend-docker so that we can initilize them on multiple nodes
ctrl p + ctrl q #detach a container
docker ps -a #get the container name
docker commit container_name github_or_gitlab/repo_name/image_name:latest
docker push github_or_gitlab/repo_name/image_name:latest
- Run servning directly
cp $WORKSPACE/fastertransformer_backend/build/libtriton_fastertransformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/fastertransformer
cd $WORKSPACE && ln -s server/qa/common .
# Recommend to modify the SERVER_TIMEOUT of common/util.sh to longer time
cd $WORKSPACE/fastertransformer_backend/build/
# bash $WORKSPACE/fastertransformer_backend/tools/run_server.sh # This method fails since we add MPI features
mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/fastertransformer_backend/all_models/ &
bash $WORKSPACE/fastertransformer_backend/tools/run_client.sh
python _deps/repo-ft-src/sample/pytorch/utils/convert_gpt_token.py --out_file=triton_out # Used for checking result
- Modify the model configuration
The model configuration for Triton server is put in all_models/transformer/config.pbtxt
. User can modify the following hyper-parameters:
- candidate_num: k value of top k
- probability_threshold: p value of top p
- tensor_para_size: size of tensor parallelism
- layer_para_size: size of layer parallelism
- layer_para_batch_size: Useless in Triton backend becuase this backend only supports single node, and user are recommended to use tensor parallel in single node
- max_seq_len: max supported sequence length
- is_half: Using half or not
- head_num: head number of attention
- size_per_head: size per head of attention
- vocab_size: size of vocabulary
- decoder_layers: number of transformer layers
- batch_size: max supported batch size
- is_fuse_QKV: fusing QKV in one matrix multiplication or not. It also depends on the weights of QKV.
- Benchmark on single node
Run this script with different batch size, input_len, output_len, num of runs on a single node with 8 gpus, it will start the server, then start the client to get the latency and stop the server at the end.
# run with batch_size = 8, input_len = 512, output_len = 16, and run 10 times to get the average latency
bash $WORKSPACE/fastertransformer_backend/tools/benchmark_single_node.sh -b 8 -i 512 -o 16 -n 10
Warp up everything in a docker: as described in Prepare the ft-triton-backend docker step.
First allocate two nodes:
salloc -A account_name -t 10:00:00 -N 2
Then run the script shown below to start two nodes' server. Ctrl+Z
and bg
in order to run on the background.
-N and -n should be equal to the number of nodes because we start one process per node. If you need to run on three nodes, then -N 3 and -n 3.
Remeber to change tensor_para_size
and layer_para_size
if you run on multiple nodes (total number of gpus = num_gpus_per_node x num_nodes = tensor_para_size x layer_para_size
), we do suggest tensor_para_size = number of gpus in one node (e.g. 8 for DGX A100), and layer_para_size = number of nodes (2 for two nodes). Other model configuration in config.pbtxt should be modified as normal.
WORKSPACE="/workspace" # the dir you build the docker
IMAGE="github_or_gitlab/fastertransformer/multi-node-ft-triton-backend:latest"
CMD="cp $WORKSPACE/fastertransformer_backend/build/libtriton_fastertransformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/fastertransformer;/opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/fastertransformer_backend/all_models"
srun -N 2 -n 2 --mpi=pmix -o inference_server.log --container-mounts /home/account/your_network_shared_space/triton:/workspace --container-name multi-node-ft-triton --container-image $IMAGE bash -c "$CMD"
Next, enter the master triton node (the node where MPI_Rank = 0, normally it is the allocated node with the smallest id) when servers have been started shown in the inference log:
srun -w master-node-name --overlap --container-name multi-node-ft-triton --container-mounts /home/account/your_network_shared_space/triton:/workspace --pty bash # --overlap may not be needed in your slurm environment
Finally, run the client in the master triton node:
export WORKSPACE="/workspace"
bash $WORKSPACE/fastertransformer_backend/tools/run_client.sh
You can refer to inference_server.log
on the login-node for the inference server log.
When you enter the master triton node, and send a request through the client, you can get the client.log
, error.log
and triton_out
in the current directory.
You can modify $WORKSPACE/fastertransformer_backend/tools/identity_test.py
to have different batch size
, input length
and output length
in requests.
In order to run multiple nodes, you have to make sure that two nodes can access to each other without ssh issues. The process is almost the same as Enroot/Pyxis clusters: run servers on two nodes with MPIRUN or PMIX, and go to the master node to send requests to servers through the client. The script may differ according to your clusters and environment, but all need to make sure two nodes can get ssh access to each other and call MPIRUN on two nodes.
export IMAGE="github_or_gitlab/fastertransformer/multi-node-ft-triton-backend:latest" # the image you update in the previous step
export WORKSPACE="/home/name/workspace" # your workspace
srun -N2 -n2 -t 600 --pty bash # Assume the two nodes are luna-01, luna-02
srun -N2 -n2 docker pull $IMAGE
srun -N2 -n2 nvidia-docker run -itd --rm --privileged --network=host --pid=host --cap-add=IPC_LOCK --device=/dev/infiniband -v /$CONT_VOL:$HOST_VOL -v $WORKSPACE:$WORKSPACE -w $WORKSPACE --name ft-backend-test $IMAGE /bin/bash
#set up ssh
srun -N2 -n2 nvidia-docker exec -i --env SLURM_NTASKS --env SLURM_NODEID --env SLURM_PROCID --env SLURM_STEP_NODELIST --env SLURMD_NODENAME --privileged ft-backend-test bash -c "mkdir /root/.ssh && cp $WORKSPACE/ssh/* /root/.ssh && chmod 700 /root/.ssh && chmod 640 /root/.ssh/authorized_keys && chmod 400 /root/.ssh/id_rsa && apt-get update && apt-get install ssh -y && mkdir /run/sshd/ && /usr/sbin/sshd -p 11068 && nvidia-smi -lgc 1530"
# luna-01, luna-02
nvidia-docker exec -ti ft-backend-test bash
cd fastertransformer_backend/build
mpirun --allow-run-as-root -np 2 -H luna-01:1,luna-02:1 -mca plm_rsh_args "-p 11068" cp $WORKSPACE/fastertransformer_backend/build/libtriton_fastertransformer.so $WORKSPACE/fastertransformer_backend/build/lib/libtransformer-shared.so /opt/tritonserver/backends/transformer
mpirun --allow-run-as-root -np 2 -H luna-01:1,luna-02:1 -mca plm_rsh_args "-p 11068" /opt/tritonserver/bin/tritonserver --model-repository=$WORKSPACE/fastertransformer_backend/all_models &
bash $WORKSPACE/fastertransformer_backend/tools/run_client.sh