trt give random output value, diffs with onnxruntime

Question

trt give random output value, diffs with onnxruntime

tpoisonooo opened this issue a year ago · 20 comments

Description

I am using python tensorrt to convert onnx, the script not finish after 2 hours.

But for trtexec --onnx=model.onnx --fp16, it would stop normally and give me model.engine.

Environment

TensorRT Version: 8.6.1.6 GA, here is download url
NVIDIA GPU: GTX1660
NVIDIA Driver Version: 515.86.01
CUDA Version: cu117
CUDNN Version: 8.4.1
Operating System: ubuntu20.04
Python Version (if applicable): 3.9
Tensorflow Version (if applicable): -
PyTorch Version (if applicable): torch2.0
Baremetal or Container (if so, version):

Relevant Files

fp16 onnx model download here https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
single script download here: https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/tools/onnx-to-trt.py

Steps To Reproduce

Download onnx and save it to onnx_model_dir
Install python trt, run the script

$ python3 onnx-to-trt.py  onnx_model_dir   output_engine_dir

And this script would not finish

But trtexec works

$ trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16
$ ls
.. decoder.engine

Notes

This onnx is part of LLaMa huggingface format.

Since LLaMa needs cache and if opr here, I have to build a empty_tensor to hack it.

So past_key_in.min_shape is [1,32,0,128], it works on onnxruntime.

Answer 1 · 2023-05-05T03:07:20.000Z

But for trtexec --onnx=model.onnx --fp16, it would stop normally and give me model.engine.

I run inferece with this model.engine but get wrong outputs.
Does current TensorRT 8.6.1.6 GA support LLaMA?

Answer 2 · 2023-05-07T02:21:08.000Z

trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16

Does your model has dynamic shapes, if yes then you need to set the input shapes.

If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.

Answer 3 · 2023-05-08T07:18:20.000Z

trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16
Does your model has dynamic shapes, if yes then you need to set the input shapes.

If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.

Thanks, I have converted it to .engine with

trtexec --onnx=decoder-merge-0.onnx 
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128  
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128  
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16  --saveEngine=decoder.engine

Let me try inference value later.

Answer 4 · 2023-05-09T08:23:13.000Z

@zerollzeng I got wrong output value from trt, which diffs with onnxruntime. Here is reproduction:

download onnx, take decoder-merge-0.onnx as example https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
generate .engine with trtexec mentioned before

trtexec --onnx=decoder-merge-0.onnx 
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128  
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128  
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16  --saveEngine=decoder-merge-0.engine

download testdata https://github.com/tpoisonooo/llama.onnx/tree/add-trt-backend/data , there are 3 numpy array

$ llama.onnx git:(add-trt-backend) cd data && ls *
attn_mask.npy  hidden_in.npy  position_ids.npy

open this single python script, give onnx/engine filepath https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/llama/trt_wrapper.py#L157

    # inference with trt
    trt_wrapper = TrtWrapper('path/to/decoder-merge-0.engine')
    trt_outputs = trt_wrapper.forward(_inputs)

    # with ort
    ort_wrapper = OrtWrapper('path/to/decoder-merge-0.onnx')
    ort_outputs = ort_wrapper.forward(_inputs)

run, it would print np.allclose and diff.max()

(base) ➜  llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:20:05.846 | DEBUG    | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
7.645

# again
(base) ➜  llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:22:33.620 | DEBUG    | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
4.492

trt outputs would give me random value

cc @lingffff

Answer 5 · 2023-05-09T15:06:04.000Z

Could you try with Polygraphy? see https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/run/01_comparing_frameworks

a quick check would be like polygraphy run decoder-merge-0.onnx --trt --fp16 --onnxrt --trt-min-shapes xxx --trt-opt-shapes xxx --trt-max-shapes xxx --input-shapes xxx --data-loader-script data_loader.py

refer to polygraphy run -h

Answer 6 · 2023-05-10T01:50:17.000Z

Ah.. why we have to learn these much tools QaQ .. I will try it later.

Answer 7 · 2023-05-11T02:24:33.000Z

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me.
First, I converted it using trtexec：
./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

Answer 8 · 2023-05-11T14:17:46.000Z

Hi @Oldpan , I notice that you set "--minShapes=past_key_in:1x32x1x128" but @tpoisonooo set "--minShapes=past_key_in:1x32x0x128". Maybe this "zero tensor" feature causes the problem?

Answer 9 · 2023-05-12T05:51:36.000Z

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec： ./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

There is a cache in llama, so "--minShapes=past_key_in:1x32x0x128". Or I have to export two kinds of decoder.onnx. cc @lingffff @Oldpan

Please check past_key_value in modeling_llama.py https://github.com/huggingface/transformers/blob/273f5ba0266b223c1d611bd00d4a4b2d58771a33/src/transformers/models/llama/modeling_llama.py#L213

        if past_key_value is not None:
            kv_seq_len += past_key_value[0].shape[-2]
..

        if past_key_value is not None:
            # reuse k, v, self_attention
            key_states = torch.cat([past_key_value[0], key_states], dim=2)
            value_states = torch.cat([past_key_value[1], value_states], dim=2)

        past_key_value = (key_states, value_states) if use_cache else None

Answer 10 · 2023-05-12T11:19:14.000Z

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated？

Answer 11 · 2023-05-14T13:18:38.000Z

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated？

On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second

Answer 12 · 2023-06-02T07:37:35.000Z

I wonder when this problem will be solved, thank you

Answer 13 · 2023-06-16T03:56:23.000Z

I wonder if there is any progress on this issue

Answer 14 · 2023-06-19T07:40:27.000Z

@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec： ./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt

And then use Polygraphy to test accurate:

polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json  --atol 1e-2 --rtol 1e-3

The output is

[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57     | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. 
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Input(s) ----
    {hidden_in [dtype=float16, shape=(1, 1, 4096)],
     attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
     position_ids [dtype=int32, shape=(1, 1)],
     past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
     past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57    
    ---- Inference Output(s) ----
    {past_key [dtype=float16, shape=(1, 32, 50, 128)],
     past_value [dtype=float16, shape=(1, 32, 50, 128)],
     hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57     | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I]     Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I]         Error Metrics: past_key
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I]             Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I]         PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I]         Error Metrics: past_value
[I]             Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I]             Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I]         PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I]         Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I]         trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I]         onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I]         Error Metrics: hidden_out
[I]             Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I]             Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I]         PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I]     PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']

The data_loader.py is:

import numpy as np
from polygraphy.json import save_json

INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)

# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
    for _ in range(1):
        yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
               "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
               "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
               "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
               "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)}  # Still totally real data

change np.ones to np.random.rand will show FAILED in this case

Answer 15 · 2023-07-09T02:45:07.000Z

is this fixed?

Answer 16 · 2023-07-10T02:29:19.000Z

is this fixed?

Not yet, if you want to inference LLaMa on CUDA, try https://github.com/InternLM/lmdeploy

Answer 17 · 2023-07-11T00:41:19.000Z

@tpoisonooo is this approach any different than LLaMA on Optimum , I already converted to tensorRT on Optimum
huggingface/optimum#975

Answer 18 · 2023-08-02T00:42:03.000Z

@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated？

On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second

Does this test have a run demo for TensorRT?

Answer 19 · 2023-10-10T11:13:38.000Z

different

how do you converted to tensorRT on Optimum? I don't find reference about this.

Answer 20 · 2023-11-05T00:39:39.000Z

Hi all, We successfully converted to TensorRT!

Each LlamaDecoderLayer was splited into three segments: pre, mid, and post. The kv cache between the pre and mid segments reverts to PyTorch for computation. The dimensions inside pre and post do not undergo a transpose operation, enabling batch processing. The mid segment has no parameters, eliminating the need for 32 repetitions. Instead, a single mid paramater is placed on each card. see https://github.com/torchpipe/LLM.TensorRT.Serve