Improving Inference Speed (custom merge op slows down processing)

I'm working on optimizing inference speed on a custom dataset that I wrote that is based on both the Panoptic-DeepLab (resnet50 backbone) and the kMaX-DeepLab (convnext backbone) approaches. This custom dataset is very similar to cityscapes (input size, object size), just with fewer classes and fewer expected objects.

Across these tests, I'm modifying config.evaluator_options.merge_semantic_and_instance_with_tf_op (with/without op) and compiling the op against the cpu and the gpu.

The Panoptic-DeepLab model

4.5 FPS without the op
2.0 FPS with the op on the GPU
4.7 FP with the op on the CPU

The kMaX-DeepLab model:

2.0 FPS without the op
2.0 FPS with the op on the GPU
2.1 FPS with the op on the CPU

I would have thought that putting the dedicated op on the GPU would have at least helped a bit.
I suspect that this performance drop is related to a GPU-CPU communication bottleneck where the pre-op data just happens to be smaller than the post-op data (at least for panoptic-deeplab).
Is there a method or suggestions to trim down the data being sent to increase the inference speed?

Thanks!

I saw this comment, but it's unclear how one should remove semantic logits.

deeplab2/model/deeplab.py

Lines 163 to 165 in 916b7c8

    
           # Change the semantic logits to probabilities with softmax. Note 
        
           # one should remove semantic logits for faster inference. We still 
        
           # keep them since they will be used to compute evaluation loss.

It also appears that the model is using a lot of dicts for passing values between the backbone and the post-processor, which might be the root of the performance drops, as to my knowledge the base python dict operations are not gpu-supported (and are therefore CPU-bound).
Changing this line to only return result_dict['panoptic_pred'] improved performance, but there's probably more room for improvement.

deeplab2/model/deeplab.py

Line 198 in 916b7c8

return result_dict

Note: This modification probably breaks a whole bunch of things, but might be in the right direction...

Hi @SkepticRaven,

Thanks for the issue.

It seems to us that you have not successfully compiled the merging operation for GPU, and thus the merging operation is run on CPU, causing extra GPU-CPU communication.
You could run the provided unit test to make sure the merging operation is runnable on GPU.

Finally, the open-source code is not designed for fast inference speed. Instead, it is meant for tutorials. There are many redundant codes (e.g., saving visualization results and so on), and you may need to optimize them by yourself if you are aiming for better inference speed.

Cheers,

Thanks for the response! Unfortunate, but at least direct news that optimizing the network for inference speed is a task for me.

I'm not sure how to interpret the unit test, as it seems to succeed even when the op compiled for CPU. Interestingly, the CPU-compiled tests run quicker than the GPU-compiled tests (similar to my inference above for panoptic-deeplab models).
Additionally, since I think it might be related to having additional CPU-GPU communication overhead, this closed issue may be related. So, I also tested disabling soft-placement for the custom op.

The 2 compiled ops are clearly different, because the gpu one is significantly larger:

find . -name '*.so' | xargs ls -l | awk -F' ' '{ print $5, $9 }'
80600 ./cpu/deeplab2/tensorflow_ops/kernels/merge_semantic_and_instance_maps_op.so
582672 ./gpu/deeplab2/tensorflow_ops/kernels/merge_semantic_and_instance_maps_op.so

nvcc does give 3 warnings when compiling the op on the gpu, but I don't think these are related to an issue:

/usr/local/lib/python3.8/dist-packages/tensorflow/include/tensorflow/core/platform/file_system.h(579): warning #611-D: overloaded virtual function "tensorflow::FileSystem::FilesExist" is only partially overridden in class "tensorflow::WrappedFileSystem"
/usr/local/lib/python3.8/dist-packages/tensorflow/include/tensorflow/core/platform/file_system.h(579): warning #611-D: overloaded virtual function "tensorflow::FileSystem::CreateDir" is only partially overridden in class "tensorflow::WrappedFileSystem"
/usr/local/lib/python3.8/dist-packages/tensorflow/include/tensorflow/core/platform/env.h(498): warning #611-D: overloaded virtual function "tensorflow::Env::RegisterFileSystem" is only partially overridden in class "tensorflow::EnvWrapper"

CPU compiled op test:

python deeplab2/tensorflow_ops/python/kernel_tests/merge_semantic_and_instance_maps_op_test.py
Running tests under Python 3.8.10: /usr/bin/python
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps
2022-09-01 10:03:19.565365: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-01 10:03:20.042899: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10736 MB memory:  -> device: 0, name: NVIDIA TITAN X (Pascal), pci bus id: 0000:04:00.0, compute capability: 6.1
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps): 0.62s
I0901 10:03:20.188648 139773341120320 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps): 0.62s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs): 5.21s
I0901 10:03:25.398272 139773341120320 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs): 5.21s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
I0901 10:03:25.399150 139773341120320 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.test_session
[  SKIPPED ] MergeSemanticAndInstanceMapsOpGpuTest.test_session
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps): 0.0s
I0901 10:03:25.399942 139773341120320 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
I0901 10:03:25.400452 139773341120320 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.test_session
[  SKIPPED ] MergeSemanticAndInstanceMapsOpTest.test_session
----------------------------------------------------------------------
Ran 7 tests in 5.836s

OK (skipped=2)

GPU compiled op test:

python deeplab2/tensorflow_ops/python/kernel_tests/merge_semantic_and_instance_maps_op_test.py
Running tests under Python 3.8.10: /usr/bin/python
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps
2022-09-01 10:03:39.248114: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-01 10:03:39.712140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10748 MB memory:  -> device: 0, name: NVIDIA TITAN X (Pascal), pci bus id: 0000:04:00.0, compute capability: 6.1
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps): 0.61s
I0901 10:03:39.858530 140563836565312 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps): 0.61s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs): 7.34s
I0901 10:03:47.201435 140563836565312 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs): 7.34s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
I0901 10:03:47.202436 140563836565312 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.test_session
[  SKIPPED ] MergeSemanticAndInstanceMapsOpGpuTest.test_session
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps): 0.0s
I0901 10:03:47.203351 140563836565312 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
I0901 10:03:47.203943 140563836565312 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.test_session
[  SKIPPED ] MergeSemanticAndInstanceMapsOpTest.test_session
----------------------------------------------------------------------
Ran 7 tests in 7.957s

OK (skipped=2)

GPU op test when I disable soft-placement for calculating parsing_maps_gpu = merge_semantic_and_instance_maps_op.merge_semantic_and_instance_maps(...) in the test (other issue):

python deeplab2/tensorflow_ops/python/kernel_tests/merge_semantic_and_instance_maps_op_test.py
Running tests under Python 3.8.10: /usr/bin/python
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps
2022-09-01 10:20:23.257649: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-01 10:20:23.655486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10799 MB memory:  -> device: 0, name: NVIDIA TITAN X (Pascal), pci bus id: 0000:04:00.0, compute capability: 6.1
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps): 0.54s
I0901 10:20:23.801713 139899995129664 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps): 0.54s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMaps
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs): 1.19s
I0901 10:20:24.988076 139899995129664 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs): 1.19s
[  FAILED  ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
I0901 10:20:24.989388 139899995129664 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
[ RUN      ] MergeSemanticAndInstanceMapsOpGpuTest.test_session
[  SKIPPED ] MergeSemanticAndInstanceMapsOpGpuTest.test_session
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps): 0.0s
I0901 10:20:24.990205 139899995129664 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMaps
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
INFO:tensorflow:time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
I0901 10:20:24.990737 139899995129664 test_util.py:2457] time(__main__.MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit): 0.0s
[       OK ] MergeSemanticAndInstanceMapsOpTest.testMergeSemanticAndInstanceMapsWithStuffAreaLimit
[ RUN      ] MergeSemanticAndInstanceMapsOpTest.test_session
[  SKIPPED ] MergeSemanticAndInstanceMapsOpTest.test_session
======================================================================
ERROR: testMergeSemanticAndInstanceMapsWithRandomInputs (__main__.MergeSemanticAndInstanceMapsOpGpuTest)
MergeSemanticAndInstanceMapsOpGpuTest.testMergeSemanticAndInstanceMapsWithRandomInputs
----------------------------------------------------------------------
Traceback (most recent call last):
  File "deeplab2/tensorflow_ops/python/kernel_tests/merge_semantic_and_instance_maps_op_test.py", line 198, in testMergeSemanticAndInstanceMapsWithRandomInputs
    merge_semantic_and_instance_maps_op.merge_semantic_and_instance_maps(
  File "<string>", line 92, in merge_semantic_and_instance_maps
  File "<string>", line 179, in merge_semantic_and_instance_maps_eager_fallback
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Could not satisfy device specification '/job:localhost/replica:0/task:0/device:GPU:0'. enable_soft_placement=0. Supported device types [CPU]. All available devices [/job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:CPU:0]. [Op:MergeSemanticAndInstanceMaps]

----------------------------------------------------------------------
Ran 7 tests in 1.734s

FAILED (errors=1, skipped=2)

I saw this comment, but it's unclear how one should remove semantic logits.

deeplab2/model/deeplab.py

Lines 163 to 165 in 916b7c8

# Change the semantic logits to probabilities with softmax. Note

# one should remove semantic logits for faster inference. We still

# keep them since they will be used to compute evaluation loss.

It also appears that the model is using a lot of dicts for passing values between the backbone and the post-processor, which might be the root of the performance drops, as to my knowledge the base python dict operations are not gpu-supported (and are therefore CPU-bound). Changing this line to only return result_dict['panoptic_pred'] improved performance, but there's probably more room for improvement.

deeplab2/model/deeplab.py

Line 198 in 916b7c8

return result_dict

Note: This modification probably breaks a whole bunch of things, but might be in the right direction...

Followup on this idea after some poking around, that may be useful for other people:
I tried removing other uses of dictionaries and didn't see any more significant improvements in performance, so there weren't any performance drops by using dicts. That original comment about removing semantic logits is achievable as simply removing it from the dict before returning (and tensorflow will handle not actually computing that part of the graph, despite knowing how it's computed). I ended up adding a new option in evaluator.proto such that I could easily switch to this mode using a simple if statement without breaking stuff like training/evaluation routines.

For my use-case, I only need the panoptic predictions during deployment, so only returning that part of the dict gets a pretty nice 20-25% speed boost.

I'm still interested in seeing if there's any more room for improvement with something I might be doing wrong with the custom op.

Closing because the response above is provided as example code. My solution ended up being very specific to my task (only 1 object type) that it's outside the scope of my current work to provide a generalized pull request.

	# Change the semantic logits to probabilities with softmax. Note
	# one should remove semantic logits for faster inference. We still
	# keep them since they will be used to compute evaluation loss.