Why tf.where operation run slowly on GPU
Opened this issue · 1 comments
Our target is to use object detection API on NVIDIA Jetson platform.
Here are some experiments:
tf.where op takes long time on GPU
1. Test on a TITAN Z PC
2. Model: ssd_mobilenet_v1_coco_2017_11_17
3. Profiled with a method mentioned here:
https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d
4. GPU model: 15ms for CNN, 50ms for post-processing
5. CPU model: 55ms for CNN, 20ms for post-processing
GPU + CPU have better performance
- We hack the .pb file to put CNN network on GPU and map-related network on CPU
- Performance is better:
| CNN | MAP | Time |
| CPU | CPU | 0.16 ms-0.19 ms |
| GPU | GPU | 0.20 ms-0.21 ms |
| GPU | CPU | 0.07ms |
Change input resolution
Follow by issue 3243,
We test 600x600 image but still get similar execution time of default image, which is 1024x636.
Our question is:
1. What is the device placement of object detection API of ssd_mobilenet_v1_coco_2017_11_17 mentioned here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
2. Why tf.where run slower on GPU
3. How could we have the similar results(30ms) with the ssd_mobilenet_v1_coco_2017_11_17 model?
Our environment:
Ubuntu16.04 + TF-1.4 + CUDA 8 + TITAN Z/TX2
Thanks.
System information
- What is the top-level directory of the model you are using:
models/research/object_detection/ - Have I written custom code:
No custom code for reproducing the bug. I have written custom code for diagnosing. - OS Platform and Distribution:
Linux Ubuntu 16.04 - TensorFlow installed from (source or binary):
Anaconda conda-forge channel - TensorFlow version:
b'unknown' 1.4.1 (output frompython -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
) - CUDA/cuDNN version:
CUDA 8.0/cuDNN 6.0 - GPU model and memory:
1 TITAN X (Pascal) 12189MiB - Exact command to reproduce:
Run the provided object detection demo (ssd_mobilenet_v1_coco_2017_11_17 model) with a small modification in the last cell to record the inference speed:
i = 0
for _ in range(10):
image_path = TEST_IMAGE_PATHS[1]
i += 1
image = Image.open(image_path)
# the array based representation of the image will be used later in order to prepare the
# result image with boxes and labels on it.
image_np = load_image_into_numpy_array(image)
# Expand dimensions since the model expects images to have shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
# Actual detection.
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
start_time = time.time()
(boxes, scores, classes, num) = sess.run(
[detection_boxes, detection_scores, detection_classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
print('Iteration %d: %.3f sec'%(i, time.time()-start_time))
The results show that the inference speed is much shower than the reported inference speed, 30ms, in the model zoo page:
Iteration 1: 2.212 sec
Iteration 2: 0.069 sec
Iteration 3: 0.076 sec
Iteration 4: 0.068 sec
Iteration 5: 0.072 sec
Iteration 6: 0.072 sec
Iteration 7: 0.071 sec
Iteration 8: 0.079 sec
Iteration 9: 0.085 sec
Iteration 10: 0.071 sec
Describe the problem
Summary:
By directly running the provided object detection demo, the observed inference speed of object detection models in the model zoo is much slower than the reported inference speed. With some hack, a higher inference speed than the reported speed can be achieved. After some diagnostics, it is highly likely that the slow inference speed is caused by:
- tf.where and other post-processing operations are running anomaly slow on GPU; or
- The frozen inference graph is lack of the ability to optimized the GPU/CPU assignment.
proof of the hypothesis: tf.where and other post-processing operations are running anomaly slow on GPU
By outputting trace file, we can diagnose the running time of each node in details.
To output the trace file, modify the last cell of object detection demo as:
from tensorflow.python.client import timeline
with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
# Definite input and output Tensors for detection_graph
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
# Each box represents a part of the image where a particular object was detected.
detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
# Each score represent how level of confidence for each of the objects.
# Score is shown on the result image, together with the class label.
detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
num_detections = detection_graph.get_tensor_by_name('num_detections:0')
i = 0
for _ in range(10):
image_path = TEST_IMAGE_PATHS[1]
i += 1
image = Image.open(image_path)
# the array based representation of the image will be used later in order to prepare the
# result image with boxes and labels on it.
image_np = load_image_into_numpy_array(image)
# Expand dimensions since the model expects images to have shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
# Actual detection.
options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
start_time = time.time()
(boxes, scores, classes, num) = sess.run(\
[detection_boxes, detection_scores, detection_classes, num_detections], \
feed_dict={image_tensor: image_np_expanded}, \
options=options, run_metadata=run_metadata)
print('Iteration %d: %.3f sec'%(i, time.time()-start_time))
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image_np,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
category_index,
use_normalized_coordinates=True,
line_thickness=8)
plt.figure(figsize=IMAGE_SIZE)
plt.imshow(image_np)
fetched_timeline = timeline.Timeline(run_metadata.step_stats)
chrome_trace = fetched_timeline.generate_chrome_trace_format()
with open('Experiment_1.json' , 'w') as f:
f.write(chrome_trace)
The output json file has been included in the .zip file in the source code section below.
Visualizing the json file in chrome://tracing/ gives:
The CNN related operations end at ~13ms. We have noticed that adding the trace function will further slow down the inference speed. But it is shows clearly that the post-processing operations (post CNN) run very slowly on GPU.
proof of the hypothesis: The frozen inference graph is lack of the ability to optimized the GPU/CPU assignment
We add some hack trying to see can we achieve a higher inference speed. The hack is manually assigning the CNN related nodes on GPU and the rest nodes on CPU. The idea is using GPU to accelerate only CNN operations and leave the post-processing operations on CPU.
The source code has been included in the .zip file in the source code section below.
With this hack, we are able to observe a higher inference speed than the reported speed.
Iteration 1: 1.021 sec
Iteration 2: 0.027 sec
Iteration 3: 0.026 sec
Iteration 4: 0.027 sec
Iteration 5: 0.026 sec
Iteration 6: 0.026 sec
Iteration 7: 0.026 sec
Iteration 8: 0.031 sec
Iteration 9: 0.031 sec
Iteration 10: 0.026 sec
To verify the hypothesis, here are some questions we need from the tensorflow team:
-
Are the numbers of inference speed reported on the detection model zoo page tested on the frozen inference graphs or original graphs?
-
Are the slow tf.where and other post-processing operations supposed to run on GPU or CPU? Is the slow running speed on GPU normal?
-
Is there a device assigning function to optimize the GPU/CPU use in the original tensorflow graphs? Is that function missing in the frozen inference graphs?