galeone/tfgo

Slow inference

Closed this issue · 11 comments

First, thanks so much for working on this library! I'm using it for serving a couple of models with good success.

I ran into something that surprised me today; it's not necessarily an issue with this library, but wanted to get your thoughts. I'm comparing execution of an SSD detector via a python script and a go executable. Just comparing the elapsed wall clock time with unix time command, the go executable is quite a bit faster -- around 4 seconds vs around 13 seconds for the python script. However, I wanted finer-grained timing data, so I measured just the model.Exec call in go vs the model.predict call in python, and I found that the go version is roughly twice as slow (3 sec vs 1.7 sec). My guess is that for a single run, parsing all the python code accounts for all of the extra time.

Both of these are using the same model, although the python version is defining the model in code and loading weights from an hdf5 file, while the go version is loading model + weights from the SavedModel file (.pb format) -- not sure if that would make any difference.

Do you have any ideas about why the graph execution would be slower under go, or how I could speed it up?

Thanks!

Hi! I'm happy that someone else is using tfgo, I really appreciate it (also, if you have any feedback or you wanna contribute you're welcome)!

Could you please show your Python and Go code? In this way, we can compare both and maybe together find out from where the slow down comes from.

The first thing that comes to my mind is that the slowdown could come from the session instantiation, but it's only a feeling. Post your codes and let's dig through them

Awesome, thanks for digging into this!

Here's the python version:

from models import SSD300
from models.utils import BBoxUtility
import models.weights

import cv2
import time
import tensorflow as tf
import numpy as np
from keras.applications.imagenet_utils import preprocess_input

weights, classes = models.weights.ssd300()
bbox_util = BBoxUtility(len(classes))

model = SSD300((300, 300, 3), num_classes=len(classes))
model.load_weights(weights, by_name=True)
model._make_predict_function()
graph = tf.get_default_graph()

im = cv2.imread("image.jpg")
if len(im.shape) == 2:
    im = cv2.cvtColor(im, cv2.COLOR_GRAY2BGR)

start = time.time()
I = cv2.resize(im, (300, 300))
after_resize = time.time()
with graph.as_default():
    after_default_graph = time.time()
    inputs = preprocess_input(np.array([I.astype('float')]))
    after_preprocess = time.time()
    preds = model.predict(inputs, batch_size=1, verbose=0)
    after_pred = time.time()
    results = bbox_util.detection_out(preds)
    after_decode = time.time()
    print("resize: %f sec" % (after_resize - start))
    print("set graph: %f sec" % (after_default_graph - after_resize))
    print("preprocess: %f sec" % (after_preprocess - after_default_graph))
    print("inference: %f sec" % (after_pred - after_preprocess))
    print("box decode: %f sec" % (after_decode - after_pred))

The go version is in two parts -- I build a Detector which loads the model, and then I have a method on that object to do inference:

// build detector
func NewDetector(classes []string) *Detector {
	model := tg.LoadModel("saved_model_ssd300", []string{"serve"}, nil)
	d := &Detector{model: model, classes: classes}
	return d
}

// process an image
func (detector *Detector) Detect(img image.Image, verbose bool) Predictions {
	tick := time.Now()

	bounds := img.Bounds().Size()
	resized := resizeImage(img)

	if verbose {
		log.Printf("resize: %s", time.Since(tick))
		tick = time.Now()
	}

	input, _ := tf.NewTensor(imageToArray(resized))

	if verbose {
		log.Printf("preprocess: %s", time.Since(tick))
		tick = time.Now()
	}

	results := detector.model.Exec([]tf.Output{
		detector.model.Op("predictions/concat", 0),
	}, map[tf.Output]*tf.Tensor{
		detector.model.Op("input_1", 0): input,
	})

	if verbose {
		log.Printf("inference: %s", time.Since(tick))
		tick = time.Now()
	}

	raw := results[0].Value().([][][]float32)

	preds := map[string]PredList{}
	for idx, cls := range detector.classes {
		if idx > 0 { // skip background
			preds[cls] = PredList{}
		}
	}

	for _, pred := range decodeBoxes(raw[0], scale(bounds)) {
		cls := detector.classes[pred.Label]
		preds[cls] = append(preds[cls], pred)
	}

	for idx, cls := range detector.classes {
		if idx > 0 { // skip background
			sort.Sort(preds[cls])
		}
	}

	if verbose {
		log.Printf("box decode: %s", time.Since(tick))
		tick = time.Now()
	}

	return preds
}

There are several timing differences that I understand (image preprocessing is much faster in Python since it is vectorized; my go box decoder "cheats" by using heuristics instead of true NMS to collapse box predictions; etc), but I am ignoring all of those. If I just look at inference time, the python version averages 1.7 sec over 10 consecutive runs, while the go version averages 2.5 sec. (running on my CPU) Since that part of the code is executed by tensorflow in both cases, I would expect the timing to be similar. Hopefully I'm just doing something wrong!

The inference time should be similar, you're right.

The only thing that I see is that you're measuring the inference time actually before executing the real inference.

You're invoking *Model.Op to extract tensors by name, here:

	results := detector.model.Exec([]tf.Output{
		detector.model.Op("predictions/concat", 0),
	}, map[tf.Output]*tf.Tensor{
		detector.model.Op("input_1", 0): input,
	})

Try moving out the 2 .Op invocations and then measure the performance, just like that:

	input, _ := tf.NewTensor(imageToArray(resized))
        modelOutput := detector.model.Op("predictions/concat", 0)
        modelInput := detector.model.Op("input_1", 0)

	if verbose {
		log.Printf("preprocess: %s", time.Since(tick))
		tick = time.Now()
	}

	results := detector.model.Exec([]tf.Output{
		modelOutput,
	}, map[tf.Output]*tf.Tensor{
		modelInput: input,
	})

	if verbose {
		log.Printf("inference: %s", time.Since(tick))
		tick = time.Now()
	}

Let me know if this will make the inference time similar

I changed it to:

	modelOutput := detector.model.Op("predictions/concat", 0)
	modelInput := detector.model.Op("input_1", 0)

	if verbose {
		log.Printf("load ops: %s", time.Since(tick))
		tick = time.Now()
	}

	results := detector.model.Exec([]tf.Output{
		modelOutput,
	}, map[tf.Output]*tf.Tensor{
		modelInput: input,
	})

	if verbose {
		log.Printf("inference: %s", time.Since(tick))
		tick = time.Now()
	}

The inference time looks similar, and over 10 runs, my "load ops" times were all between 20 and 35 µs.

Nice! We do have already found out the problem? :D

Sorry, that was poor wording. The inference time did not change much from what it was before -- between 2.4 and 2.9 sec over 10 runs.

Ah! I guessed that the go inference time looked similar to the python inference time. My bad.

We have to dig more into the code. Can you please share the code of the predict method used in preds = model.predict(inputs, batch_size=1, verbose=0) ?

We have to spot the differences between that code and the code of model.Exec

Uhm, I'm trying to figure out what the problem is but I can't see it.
The python code is just a standard python code, nothing special, the Go code is just this model.saved.Session.Run(feedDict, tensors, nil) that only uses operations from the github.com/tensorflow/tensorflow/tensorflow/go package.

Maybe we should open an issue on https://github.com/tensorflow/tensorflow/issues referencing this one? I guess the inference time difference is due to something they do and not related to tfgo.

Ok, no worries -- thank you for being another set of eyes! I'll poke around a bit more and open an issue there if I can't find anything.

Keep me up to date. Thank you!