/timesfromer_trtfp16_error

This repository shows when converting timesformer to tensorRT in fp16, there is large error

Primary LanguagePython

timesfromer_trtfp16_error

TimesFormer is the new Facebook's transformer based video model. https://github.com/facebookresearch/TimeSformer

This repository attempts to create TensorRT engine for TimeSformer. It shows when converting in fp16, there are large discrepancies.

The network structure is fairly straightforward:

  • for video of size 8x224x224, generate 8x14x14 patches of embeddings
  • run 12 blocks of T-S self attention blocks
  • use the final classification token as the feature to compute classification result

Hence the following results are validated against known inputs-outputs:

  • Final probability
  • Patches features after each of the 12 T-S blocks,

So far we have following findings:

  • The FP32 engine produces correct results
  • The FP16 engine produces incorrect results without making any modifications, the error on patch features jump from 0.01 to 0.4 on 6th block
  • If we expose the difference tensor from layer norm as outputs, FP16 engine produces correct results

Environment

the environt is build based on the tensorrt docker in NGC
docker pull nvcr.io/nvidia/tensorrt:21.06-py3

TensorRT: 7.2.3.4
CUDA: 11.3

To Run

# clone repo
git clone https://github.com/zmy1116/timesfromer_trtfp16_error

cd timesfromer_trtfp16_error

# download pretrained weights from TimeSformer official repo
wget https://www.dropbox.com/s/g5t24we9gl5yk88/TimeSformer_divST_8x32_224_K400.pyth

# Build and test engine in FP16 with/without expose layernorm diffs as outputs
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp16engine_expose_ln result_fp16_expose_ln.p --expose_ln_entries --fp16
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp16engine result_fp16 --fp16


# Build and test engine in FP32 with/without expose layernrom diffs as outputs 
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp32engine_expose_ln result_fp32_expose_ln.p --expose_ln_entries
python conversions.py TimeSformer_divST_8x32_224_K400.pyth test_data.p fp32engine result_fp32

I have included printed traces/outputs for 4 cases in the traces folder.

Results on FP16 without exposing layernorm diffs looks like following, notice sudden error jump at 6th block

Run evaluation on test data
Final probabilities difference: 1.0182867
Patch embedding diff after 0 S-T blocks: 0.004299811087548733
Patch embedding diff after 1 S-T blocks: 0.00536126596853137
Patch embedding diff after 2 S-T blocks: 0.00613724160939455
Patch embedding diff after 3 S-T blocks: 0.00694211944937706
Patch embedding diff after 4 S-T blocks: 0.009320137090981007
Patch embedding diff after 5 S-T blocks: 0.013058185577392578
Patch embedding diff after 6 S-T blocks: 0.4091333746910095
Patch embedding diff after 7 S-T blocks: 0.430844247341156
Patch embedding diff after 8 S-T blocks: 0.44067439436912537
Patch embedding diff after 9 S-T blocks: 0.46074366569519043
Patch embedding diff after 10 S-T blocks: 0.45822423696517944
Patch embedding diff after 11 S-T blocks: 0.5575547814369202

Results on FP16 with exposing layernorm diffs looks like following

Final probabilities difference: 0.0033378536
Patch embedding diff after 0 S-T blocks: 0.0042906636372208595
Patch embedding diff after 1 S-T blocks: 0.005335847847163677
Patch embedding diff after 2 S-T blocks: 0.0060969325713813305
Patch embedding diff after 3 S-T blocks: 0.006786249577999115
Patch embedding diff after 4 S-T blocks: 0.00906979851424694
Patch embedding diff after 5 S-T blocks: 0.014810287393629551
Patch embedding diff after 6 S-T blocks: 0.015742896124720573
Patch embedding diff after 7 S-T blocks: 0.015805872157216072
Patch embedding diff after 8 S-T blocks: 0.015810402110219002
Patch embedding diff after 9 S-T blocks: 0.01605822518467903
Patch embedding diff after 10 S-T blocks: 0.016058534383773804
Patch embedding diff after 11 S-T blocks: 0.01618446595966816