trt give random output value, diffs with onnxruntime
tpoisonooo opened this issue · 20 comments
Description
I am using python tensorrt to convert onnx, the script not finish after 2 hours.
But for trtexec --onnx=model.onnx --fp16
, it would stop normally and give me model.engine
.
Environment
TensorRT Version: 8.6.1.6 GA, here is download url
NVIDIA GPU: GTX1660
NVIDIA Driver Version: 515.86.01
CUDA Version: cu117
CUDNN Version: 8.4.1
Operating System: ubuntu20.04
Python Version (if applicable): 3.9
Tensorflow Version (if applicable): -
PyTorch Version (if applicable): torch2.0
Baremetal or Container (if so, version):
Relevant Files
fp16 onnx model download here https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx
single script download here: https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/tools/onnx-to-trt.py
Steps To Reproduce
- Download onnx and save it to
onnx_model_dir
- Install python trt, run the script
$ python3 onnx-to-trt.py onnx_model_dir output_engine_dir
And this script would not finish
- But
trtexec
works
$ trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16
$ ls
.. decoder.engine
Notes
This onnx is part of LLaMa huggingface format.
Since LLaMa needs cache
and if
opr here, I have to build a empty_tensor to hack it.
So past_key_in.min_shape is [1,32,0,128]
, it works on onnxruntime.
But for trtexec --onnx=model.onnx --fp16, it would stop normally and give me model.engine.
I run inferece with this model.engine
but get wrong outputs.
Does current TensorRT 8.6.1.6 GA
support LLaMA
?
trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16
Does your model has dynamic shapes, if yes then you need to set the input shapes.
If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.
trtexec --onnx=/path/to/onnx_models/decoder-merge-0.onnx --fp16Does your model has dynamic shapes, if yes then you need to set the input shapes.
If trtexec can build the engine normally, then there are some issues in your python scripts. and I think we support LLaMA and you already can build the engine with trtexec.
Thanks, I have converted it to .engine with
trtexec --onnx=decoder-merge-0.onnx
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16 --saveEngine=decoder.engine
Let me try inference value later.
@zerollzeng I got wrong output value from trt, which diffs with onnxruntime. Here is reproduction:
- download onnx, take
decoder-merge-0.onnx
as example https://huggingface.co/tpoisonooo/alpaca.onnx/blob/fp16/decoder-merge-0.onnx - generate .engine with
trtexec
mentioned before
trtexec --onnx=decoder-merge-0.onnx
--minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x0x128,past_value_in:1x32x0x128
--optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128
--maxShapes=hidden_in:1x64x4096,attn_mask:1x1x64x192,position_ids:1x64,past_key_in:1x32x192x128,past_value_in:1x32x192x128
--fp16 --saveEngine=decoder-merge-0.engine
- download testdata https://github.com/tpoisonooo/llama.onnx/tree/add-trt-backend/data , there are 3 numpy array
$ llama.onnx git:(add-trt-backend) cd data && ls *
attn_mask.npy hidden_in.npy position_ids.npy
- open this single python script, give onnx/engine filepath https://github.com/tpoisonooo/llama.onnx/blob/add-trt-backend/llama/trt_wrapper.py#L157
# inference with trt
trt_wrapper = TrtWrapper('path/to/decoder-merge-0.engine')
trt_outputs = trt_wrapper.forward(_inputs)
# with ort
ort_wrapper = OrtWrapper('path/to/decoder-merge-0.onnx')
ort_outputs = ort_wrapper.forward(_inputs)
- run, it would print
np.allclose
anddiff.max()
(base) ➜ llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:20:05.846 | DEBUG | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
7.645
# again
(base) ➜ llama git:(add-trt-backend) python3 trt_wrapper.py
2023-05-09 16:22:33.620 | DEBUG | __main__:__init__:128 - /home/khj/下载/7b-onnx/alpaca-onnx-7B-fp16/models/decoder-merge-0.onnx loaded
False
False
False
4.492
trt outputs would give me random value
cc @lingffff
Could you try with Polygraphy? see https://github.com/NVIDIA/TensorRT/tree/main/tools/Polygraphy/examples/cli/run/01_comparing_frameworks
a quick check would be like polygraphy run decoder-merge-0.onnx --trt --fp16 --onnxrt --trt-min-shapes xxx --trt-opt-shapes xxx --trt-max-shapes xxx --input-shapes xxx --data-loader-script data_loader.py
refer to polygraphy run -h
Ah.. why we have to learn these much tools QaQ .. I will try it later.
@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me.
First, I converted it using trtexec:
./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt
And then use Polygraphy to test accurate:
polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py
polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
The output is
[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3
[I] Saving custom input data to custom_inputs.json
[I] trt-runner-N0-05/11/23-10:21:57 | Activating and starting inference
[I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast.
[I] trt-runner-N0-05/11/23-10:21:57
---- Inference Input(s) ----
{hidden_in [dtype=float16, shape=(1, 1, 4096)],
attn_mask [dtype=float16, shape=(1, 1, 1, 50)],
position_ids [dtype=int32, shape=(1, 1)],
past_key_in [dtype=float16, shape=(1, 32, 49, 128)],
past_value_in [dtype=float16, shape=(1, 32, 49, 128)]}
[I] trt-runner-N0-05/11/23-10:21:57
---- Inference Output(s) ----
{past_key [dtype=float16, shape=(1, 32, 50, 128)],
past_value [dtype=float16, shape=(1, 32, 50, 128)],
hidden_out [dtype=float16, shape=(1, 1, 4096)]}
[I] trt-runner-N0-05/11/23-10:21:57 | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms.
[I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json
[I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42
[I] Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128))
[I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I] trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I] onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727
[I] Error Metrics: past_key
[I] Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06
[I] Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05
[I] PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01)
[I] Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128))
[I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I] trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I] onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395
[I] Error Metrics: past_value
[I] Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09
[I] Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08
[I] PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01)
[I] Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096))
[I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error
[I] trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887
[I] onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887
[I] Error Metrics: hidden_out
[I] Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set)
[I] Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626
[I] Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539
[I] PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01)
[I] PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']
The data_loader.py is:
import numpy as np
from polygraphy.json import save_json
INPUT_SHAPE_1 = (1, 1, 4096)
INPUT_SHAPE_2 = (1, 1, 1, 50)
INPUT_SHAPE_3 = (1, 1)
INPUT_SHAPE_4 = (1, 32, 49, 128)
INPUT_SHAPE_5 = (1, 32, 49, 128)
# --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128
def load_data():
for _ in range(1):
yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16),
"attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16),
"position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64),
"past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16),
"past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)} # Still totally real data
Hi @Oldpan , I notice that you set "--minShapes=past_key_in:1x32x1x128" but @tpoisonooo set "--minShapes=past_key_in:1x32x0x128". Maybe this "zero tensor" feature causes the problem?
@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec:
./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt
And then use Polygraphy to test accurate:
polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3The output is
[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3 [I] Saving custom input data to custom_inputs.json [I] trt-runner-N0-05/11/23-10:21:57 | Activating and starting inference [I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. [I] trt-runner-N0-05/11/23-10:21:57 ---- Inference Input(s) ---- {hidden_in [dtype=float16, shape=(1, 1, 4096)], attn_mask [dtype=float16, shape=(1, 1, 1, 50)], position_ids [dtype=int32, shape=(1, 1)], past_key_in [dtype=float16, shape=(1, 32, 49, 128)], past_value_in [dtype=float16, shape=(1, 32, 49, 128)]} [I] trt-runner-N0-05/11/23-10:21:57 ---- Inference Output(s) ---- {past_key [dtype=float16, shape=(1, 32, 50, 128)], past_value [dtype=float16, shape=(1, 32, 50, 128)], hidden_out [dtype=float16, shape=(1, 1, 4096)]} [I] trt-runner-N0-05/11/23-10:21:57 | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms. [I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json [I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42 [I] Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) [I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error [I] trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727 [I] onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727 [I] Error Metrics: past_key [I] Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set) [I] Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06 [I] Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05 [I] PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01) [I] Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) [I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error [I] trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395 [I] onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395 [I] Error Metrics: past_value [I] Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set) [I] Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09 [I] Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08 [I] PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01) [I] Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) [I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error [I] trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887 [I] onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887 [I] Error Metrics: hidden_out [I] Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set) [I] Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626 [I] Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539 [I] PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01) [I] PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']
The data_loader.py is:
import numpy as np from polygraphy.json import save_json INPUT_SHAPE_1 = (1, 1, 4096) INPUT_SHAPE_2 = (1, 1, 1, 50) INPUT_SHAPE_3 = (1, 1) INPUT_SHAPE_4 = (1, 32, 49, 128) INPUT_SHAPE_5 = (1, 32, 49, 128) # --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 def load_data(): for _ in range(1): yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16), "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16), "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64), "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16), "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)} # Still totally real data
There is a cache in llama, so "--minShapes=past_key_in:1x32x0x128". Or I have to export two kinds of decoder.onnx
. cc @lingffff @Oldpan
Please check past_key_value
in modeling_llama.py
https://github.com/huggingface/transformers/blob/273f5ba0266b223c1d611bd00d4a4b2d58771a33/src/transformers/models/llama/modeling_llama.py#L213
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
..
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated?
@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated?
On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second
I wonder when this problem will be solved, thank you
I wonder if there is any progress on this issue
@tpoisonooo Hi, I tested the precision using versions 8.6.0.12 and 8.6.1.6 before, and it seems good to me. First, I converted it using trtexec:
./trtexec --onnx=/home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-5.onnx \ --minShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1,position_ids:1x1,past_key_in:1x32x1x128,past_value_in:1x32x1x128 \ --optShapes=hidden_in:1x1x4096,attn_mask:1x1x1x1029,position_ids:1x1,past_key_in:1x32x1028x128,past_value_in:1x32x1028x128 \ --maxShapes=hidden_in:1x1x4096,attn_mask:1x1x1x2049,position_ids:1x1,past_key_in:1x32x2048x128,past_value_in:1x32x2048x128 \ --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 \ --fp16 --saveEngine=decoder-merge-5.trt
And then use Polygraphy to test accurate:
polygraphy run --onnxrt /home/oldpan/code/models/GPT/LLAMA/alpaca.onnx/decoder-merge-4.onnx --save-results=/home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --data-loader-script /home/oldpan/code/convert/tools/data_loader.py polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/convert/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3The output is
[I] RUNNING | Command: /home/oldpan/miniconda3/envs/develop/bin/polygraphy run /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt --model-type engine --trt --data-loader-script /home/oldpan/code/project/llama.ddeploy/tools/data_loader.py --load-outputs /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json --atol 1e-2 --rtol 1e-3 [I] Saving custom input data to custom_inputs.json [I] trt-runner-N0-05/11/23-10:21:57 | Activating and starting inference [I] Loading bytes from /home/oldpan/code/models/normal/tensorrt_engine/llama-trt/decoder-merge-4.trt [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading [W] Input tensor: position_ids | Buffer dtype (int64) does not match expected input dtype (int32), attempting to cast. [I] trt-runner-N0-05/11/23-10:21:57 ---- Inference Input(s) ---- {hidden_in [dtype=float16, shape=(1, 1, 4096)], attn_mask [dtype=float16, shape=(1, 1, 1, 50)], position_ids [dtype=int32, shape=(1, 1)], past_key_in [dtype=float16, shape=(1, 32, 49, 128)], past_value_in [dtype=float16, shape=(1, 32, 49, 128)]} [I] trt-runner-N0-05/11/23-10:21:57 ---- Inference Output(s) ---- {past_key [dtype=float16, shape=(1, 32, 50, 128)], past_value [dtype=float16, shape=(1, 32, 50, 128)], hidden_out [dtype=float16, shape=(1, 1, 4096)]} [I] trt-runner-N0-05/11/23-10:21:57 | Completed 1 iteration(s) in 1.745 ms | Average inference time: 1.745 ms. [I] Loading inference results from /home/oldpan/code/data/debug_data/output-debug/decoder-merge-4-onnx-output.json [I] Accuracy Comparison | trt-runner-N0-05/11/23-10:21:57 vs. onnxrt-runner-N0-04/22/23-22:27:42 [I] Comparing Output: 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_key' (dtype=float16, shape=(1, 32, 50, 128)) [I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error [I] trt-runner-N0-05/11/23-10:21:57: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-1.999 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727 [I] onnxrt-runner-N0-04/22/23-22:27:42: past_key | Stats: mean=0.97991, std-dev=0.15628, var=0.024423, median=1, min=-2 at (0, 10, 49, 114), max=2.8555 at (0, 4, 49, 49), avg-magnitude=0.98727 [I] Error Metrics: past_key [I] Minimum Required Tolerance: elemwise error | [abs=0.0019531] OR [rel=2.0484] (requirements may be lower if both abs/rel tolerances are set) [I] Absolute Difference | Stats: mean=1.3869e-06, std-dev=2.5162e-05, var=6.3313e-10, median=0, min=0 at (0, 0, 0, 0), max=0.0019531 at (0, 10, 49, 43), avg-magnitude=1.3869e-06 [I] Relative Difference | Stats: mean=2.0463e-05, std-dev=0.0048362, var=2.3389e-05, median=0, min=0 at (0, 0, 0, 0), max=2.0484 at (0, 30, 49, 10), avg-magnitude=2.0463e-05 [I] PASSED | Output: 'past_key' | Difference is within tolerance (rel=0.001, abs=0.01) [I] Comparing Output: 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) with 'past_value' (dtype=float16, shape=(1, 32, 50, 128)) [I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error [I] trt-runner-N0-05/11/23-10:21:57: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395 [I] onnxrt-runner-N0-04/22/23-22:27:42: past_value | Stats: mean=0.97994, std-dev=0.14484, var=0.020979, median=1, min=-1.041 at (0, 31, 49, 55), max=1.0186 at (0, 30, 49, 121), avg-magnitude=0.98395 [I] Error Metrics: past_value [I] Minimum Required Tolerance: elemwise error | [abs=0.00024414] OR [rel=0.00092421] (requirements may be lower if both abs/rel tolerances are set) [I] Absolute Difference | Stats: mean=2.099e-09, std-dev=6.0771e-07, var=3.6931e-13, median=0, min=0 at (0, 0, 0, 0), max=0.00024414 at (0, 0, 49, 39), avg-magnitude=2.099e-09 [I] Relative Difference | Stats: mean=5.1702e-08, std-dev=5.9317e-06, var=3.5185e-11, median=0, min=0 at (0, 0, 0, 0), max=0.00092421 at (0, 17, 49, 117), avg-magnitude=5.1702e-08 [I] PASSED | Output: 'past_value' | Difference is within tolerance (rel=0.001, abs=0.01) [I] Comparing Output: 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) with 'hidden_out' (dtype=float16, shape=(1, 1, 4096)) [I] Tolerance: [abs=0.01, rel=0.001] | Checking elemwise error [I] trt-runner-N0-05/11/23-10:21:57: hidden_out | Stats: mean=1.0164, std-dev=1.0491, var=1.1007, median=0.98901, min=-3.5781 at (0, 0, 1181), max=9.3516 at (0, 0, 3840), avg-magnitude=1.1887 [I] onnxrt-runner-N0-04/22/23-22:27:42: hidden_out | Stats: mean=1.0164, std-dev=1.0492, var=1.1007, median=0.9895, min=-3.5781 at (0, 0, 1181), max=9.3594 at (0, 0, 3840), avg-magnitude=1.1887 [I] Error Metrics: hidden_out [I] Minimum Required Tolerance: elemwise error | [abs=0.0078125] OR [rel=0.4306] (requirements may be lower if both abs/rel tolerances are set) [I] Absolute Difference | Stats: mean=0.0005626, std-dev=0.00064283, var=4.1323e-07, median=0.00048828, min=0 at (0, 0, 0), max=0.0078125 at (0, 0, 3840), avg-magnitude=0.0005626 [I] Relative Difference | Stats: mean=0.0017539, std-dev=0.011641, var=0.0001355, median=0.00062073, min=0 at (0, 0, 0), max=0.4306 at (0, 0, 1582), avg-magnitude=0.0017539 [I] PASSED | Output: 'hidden_out' | Difference is within tolerance (rel=0.001, abs=0.01) [I] PASSED | All outputs matched | Outputs: ['past_key', 'past_value', 'hidden_out']
The data_loader.py is:
import numpy as np from polygraphy.json import save_json INPUT_SHAPE_1 = (1, 1, 4096) INPUT_SHAPE_2 = (1, 1, 1, 50) INPUT_SHAPE_3 = (1, 1) INPUT_SHAPE_4 = (1, 32, 49, 128) INPUT_SHAPE_5 = (1, 32, 49, 128) # --shapes=hidden_in:1x1x4096,attn_mask:1x1x1x50,position_ids:1x1,past_key_in:1x32x49x128,past_value_in:1x32x49x128 def load_data(): for _ in range(1): yield {"hidden_in": np.ones(shape=INPUT_SHAPE_1, dtype=np.float16), "attn_mask": np.ones(shape=INPUT_SHAPE_2, dtype=np.float16), "position_ids": np.ones(shape=INPUT_SHAPE_3, dtype=np.int64), "past_key_in": np.ones(shape=INPUT_SHAPE_4, dtype=np.float16), "past_value_in": np.ones(shape=INPUT_SHAPE_5, dtype=np.float16)} # Still totally real data
change np.ones to np.random.rand will show FAILED in this case
is this fixed?
is this fixed?
Not yet, if you want to inference LLaMa on CUDA, try https://github.com/InternLM/lmdeploy
@tpoisonooo is this approach any different than LLaMA on Optimum , I already converted to tensorRT on Optimum
huggingface/optimum#975
@tpoisonooo hi, do you test llama's performance on tensorrt and how fast the token was generated?
On my GTX1060 TI, I got 3.5~5ms per backbone decoder with llama 7B fp16 model, AKA 1000 / (5*32) = 6 token/second
Does this test have a run demo for TensorRT?
different
how do you converted to tensorRT on Optimum? I don't find reference about this.
Hi all, We successfully converted to TensorRT!
Each LlamaDecoderLayer was splited into three segments: pre, mid, and post. The kv cache between the pre and mid segments reverts to PyTorch for computation. The dimensions inside pre and post do not undergo a transpose operation, enabling batch processing. The mid segment has no parameters, eliminating the need for 32 repetitions. Instead, a single mid paramater is placed on each card. see https://github.com/torchpipe/LLM.TensorRT.Serve