https://github.com/hiveml/simple-ml-serving
This post goes over a quick and dirty way to deploy a trained machine learning model to production.
Read this if: You've successfully trained an ML model using a framework such as Tensorflow or Caffe that you would like to put up as a demo, preferably sooner rather than later, and you prefer lighter solutions rather than spinning up an entire tech stack.
Reading time: 10-15 mins
TL;DR Read and understand the files in test
.
- Check your tensorflow installation
- Run online classification from stdin
- Run online classification on localhost
- Put classifiers behind a hardcoded proxy
- Put classifiers behind a proxy with service discovery
- Call classifiers using a pseudo-DNS
When we first entered the machine learning space here at Hive, we had been doing manual image moderation for half a year, giving us millions of ground truth labeled images. This allowed us to train a state-of-the-art deep convolutional image classification model from scratch (i.e. randomized weights) in under a week, specialized for our use case. The more typical ML use case, though, is usually on the order of hundreds of images, for which I would recommend fine-tuning an existing model. For instance, https://www.tensorflow.org/tutorials/image_retraining has a great tutorial on how to fine-tune an Imagenet model (trained on 1.2M images, 1000 classes) to classify a sample dataset of flowers (3647 images, 5 classes).
For a quick tl;dr of the linked Tensorflow tutorial, after installing bazel and tensorflow, you would need to run the following code, which takes around 30 mins to build and 5 minutes to train:
(
cd "$HOME" && \
curl -O http://download.tensorflow.org/example_images/flower_photos.tgz && \
tar xzf flower_photos.tgz ;
) && \
bazel build tensorflow/examples/image_retraining:retrain \
tensorflow/examples/image_retraining:label_image \
&& \
bazel-bin/tensorflow/examples/image_retraining/retrain \
--image_dir "$HOME"/flower_photos \
--how_many_training_steps=200
&& \
bazel-bin/tensorflow/examples/image_retraining/label_image \
--graph=/tmp/output_graph.pb \
--labels=/tmp/output_labels.txt \
--output_layer=final_result:0 \
--image=$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg
Alternatively, if you have Docker installed, you can use this prebuilt Docker image like so:
sudo docker run -it --net=host liubowei/simple-ml-serving:latest /bin/bash
>>> cat test.sh && bash test.sh
which puts you into an interactive shell inside the container and runs the above command; you can also follow along with the rest of this post inside the container if you wish.
Now, tensorflow has saved the model information into /tmp/output_graph.pb
and /tmp/output_labels.txt,
which are passed above as command-line parameters to the label_image.py script . Google's image_recognition tutorial also links to another inference script, but we will be sticking with label_image.py for now.
If we just want to accept file names from standard input, one per line, we can do "online" inference quite easily:
while read line ; do
bazel-bin/tensorflow/examples/image_retraining/label_image \
--graph=/tmp/output_graph.pb --labels=/tmp/output_labels.txt \
--output_layer=final_result:0 \
--image="$line" ;
done
From a performance standpoint, though, this is terrible - we are reloading the neural net, the weights, the entire Tensorflow framework, and python itself, for every input example!
We can do better. Let's start by editing the label_image.py script -- for me, this is located in bazel-bin/tensorflow/examples/image_retraining/label_image.runfiles/org_tensorflow/tensorflow/examples/image_retraining/label_image.py
.
Let's change the lines
141: run_graph(image_data, labels, FLAGS.input_layer, FLAGS.output_layer,
142: FLAGS.num_top_predictions)
to
141: for line in sys.stdin:
142: run_graph(load_image(line), labels, FLAGS.input_layer, FLAGS.output_layer,
142: FLAGS.num_top_predictions)
This is indeed a lot faster, but this is still not the best we can do!
The reason is the with tf.Session() as sess
construction on line 100. Tensorflow is essentially loading all the computation into memory every time run_graph
is called. This becomes apparent once you start trying to do inference on the GPU -- you can see the GPU memory go up and down as Tensorflow loads and unloads the model parameters to and from the GPU. As far as I know, this construction is not present in other ML frameworks like Caffe or Pytorch.
The solution is then to pull the with
statement out, and pass in a sess
variable to run_graph
:
def run_graph(image_data, labels, input_layer_name, output_layer_name,
num_top_predictions, sess):
# Feed the image_data as input to the graph.
# predictions will contain a two-dimensional array, where one
# dimension represents the input image count, and the other has
# predictions per class
softmax_tensor = sess.graph.get_tensor_by_name(output_layer_name)
predictions, = sess.run(softmax_tensor, {input_layer_name: image_data})
# Sort to show labels in order of confidence
top_k = predictions.argsort()[-num_top_predictions:][::-1]
for node_id in top_k:
human_string = labels[node_id]
score = predictions[node_id]
print('%s (score = %.5f)' % (human_string, score))
return [ (labels[node_id], predictions[node_id].item()) for node_id in top_k ] # numpy floats are not json serializable, have to run item
...
with tf.Session() as sess:
for line in sys.stdin:
run_graph(load_image(line), labels, FLAGS.input_layer, FLAGS.output_layer,
FLAGS.num_top_predictions, sess)
(see code at https://github.com/hiveml/simple-ml-serving/blob/master/label_image.py)
If you run this, you should find that it takes around 0.1 sec per image, quite fast enough for online use.
Caffe uses its net.forward
code which is very easy to put into a callable framework: see http://nbviewer.jupyter.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb
Mxnet is also very unique: it actually has ready-to-go inference server code publicly available: https://github.com/awslabs/mxnet-model-server.
Further details coming soon!
The plan is to wrap this code in a Flask app and turn it into a HTTP microservice. If you haven't heard of it, Flask is a very lightweight Python web framework which allows you to spin up an http api server with minimal work.
As a quick reference, here's a flask app that receives POST requests with multipart form data:
#!/usr/bin/env python
# usage: python echo.py to launch the server ; and then in another session, do
# curl -v -XPOST 127.0.0.1:12480 -F "data=@./image.jpg"
from flask import Flask, request
app = Flask(__name__)
@app.route('/', methods=['POST'])
def classify():
try:
data = request.files.get('data').read()
print repr(data)[:1000]
return data, 200
except Exception as e:
return repr(e), 500
app.run(host='127.0.0.1',port=12480)
And here is the corresponding flask app hooked up to run_graph
above:
#!/usr/bin/env python
# usage: bash tf_classify_server.sh
from flask import Flask, request
import tensorflow as tf
import label_image as tf_classify
import json
app = Flask(__name__)
FLAGS, unparsed = tf_classify.parser.parse_known_args()
labels = tf_classify.load_labels(FLAGS.labels)
tf_classify.load_graph(FLAGS.graph)
sess = tf.Session()
@app.route('/', methods=['POST'])
def classify():
try:
data = request.files.get('data').read()
result = tf_classify.run_graph(data, labels, FLAGS.input_layer, FLAGS.output_layer, FLAGS.num_top_predictions, sess)
return json.dumps(result), 200
except Exception as e:
return repr(e), 500
app.run(host='127.0.0.1',port=12480)
This looks quite good, except for the fact that flask and tensorflow are both fully synchronous - flask processes one request at a time in the order they are received, and Tensorflow fully occupies the thread when doing the image classification.
As it's written, the speed bottleneck is probably still in the actual computation work, so there's not much point upgrading the Flask wrapper code. And maybe this code is sufficient to handle your load, for now.
There are 2 obvious ways to scale up request throughput: scale up horizontally by increasing the number of workers, which is covered in the next section, or scale up vertically by utilizing a GPU and batching logic. Implementing the latter requires a webserver that is able to handle multiple pending requests at once, and decide whether to keep waiting for a larger batch or send it off to the Tensorflow graph thread to be classified, for which this Flask app is horrendously unsuited. Two possibilities are using Twisted + Klein for keeping code in Python, or Node.js + ZeroMQ if you prefer first class event loop support and the ability to hook into non-Python ML frameworks such as Torch.
OK, so now we have a single server serving our model, but maybe it's too slow or our load is getting too high. We'd like to spin up more of these servers - how can we distribute requests across each of them?
The ordinary method is to add a proxy layer, perhaps haproxy or nginx, which balances the load between the backend servers while presenting a single uniform interface to the client. For use later in this section, here is some sample code that runs a rudimentary Node.js load balancer http proxy:
// Usage : node basic_proxy.js WORKER_PORT_0,WORKER_PORT_1,...
const worker_ports = process.argv[2].split(',')
if (worker_ports.length === 0) { console.err('missing worker ports') ; process.exit(1) }
const proxy = require('http-proxy').createProxyServer({})
proxy.on('error', () => console.log('proxy error'))
let i = 0
require('http').createServer((req, res) => {
proxy.web(req,res, {target: 'http://localhost:' + worker_ports[ (i++) % worker_ports.length ]})
}).listen(12480)
console.log(`Proxying localhost:${12480} to [${worker_ports.toString()}]`)
// spin up the ML workers
const { exec } = require('child_process')
worker_ports.map(port => exec(`/bin/bash ./tf_classify_server.sh ${port}`))
To automatically detect how many backend servers are up and where they are located, people generally use a "service discovery" tool, which may be bundled with the load balancer or be separate. Some well-known ones are Consul and Zookeeper. Setting up and learning how to use one is beyond the scope of this article, so I've included a very rudimentary proxy using the node.js service discovery package seaport
.
Proxy code:
// Usage : node seaport_proxy.js
const seaportServer = require('seaport').createServer()
seaportServer.listen(12481)
const proxy = require('http-proxy').createProxyServer({})
proxy.on('error', () => console.log('proxy error'))
let i = 0
require('http').createServer((req, res) => {
seaportServer.get('tf_classify_server', worker_ports => {
const this_port = worker_ports[ (i++) % worker_ports.length ].port
proxy.web(req,res, {target: 'http://localhost:' + this_port })
})
}).listen(12480)
console.log(`Seaport proxy listening on ${12480} to '${'tf_classify_server'}' servers registered to ${12481}`)
Worker code:
// Usage : node tf_classify_server.js
const port = require('seaport').connect(12481).register('tf_classify_server')
console.log(`Launching tf classify worker on ${port}`)
require('child_process').exec(`/bin/bash ./tf_classify_server.sh ${port}`)
However, as applied to ML, this setup runs into a bandwidth problem.
At anywhere from tens to hundreds of images a second, the system becomes bottlenecked on network bandwidth. In the current setup, all the data has to go through our single seaport
master, which is the single endpoint presented to the client.
To solve this, we need our clients to not hit the single endpoint at http://127.0.0.1:12480
, but instead to automatically rotate between backend servers to hit. If you know some networking, this sounds precisely like a job for DNS!
However, setting up a custom DNS server is again beyond the scope of this article. Instead, by changing the clients to follow a 2-step "manual DNS" protocol, we can reuse our rudimentary seaport proxy to implement a "peer-to-peer" protocol in which clients connect directly to their servers:
Proxy code:
// Usage : node p2p_proxy.js
const seaportServer = require('seaport').createServer()
seaportServer.listen(12481)
let i = 0
require('http').createServer((req, res) => {
seaportServer.get('tf_classify_server', worker_ports => {
const this_port = worker_ports[ (i++) % worker_ports.length ].port
res.end(`${this_port}\n`)
})
}).listen(12480)
console.log(`P2P seaport proxy listening on ${12480} to '${'tf_classify_server'}' servers registered to ${12481}`)
(The worker code is the same as above.)
Client code:
curl -v -XPOST localhost:`curl localhost:12480` -F"data=@$HOME/flower_photos/daisy/21652746_cc379e0eea_m.jpg"
It's possible to replace the Flask interface above with a ZeroMQ interface, turning this code into an RPC microservice. Further details and code snippets coming soon!
At this point, you should have something working in production, but it's certainly not futureproof. There are several important topics that were not covered in this guide:
- Automatically deploying and setting up on new hardware
- Notable tools include Openstack/VMware if you're on your own hardware, Chef/Puppet for installing Docker and handling networking routes, and Docker for installing Tensorflow, Python, and everything else
- Kubernetes or Marathon/Mesos are also great if you're on the cloud
- Model version management
- Not too hard to handle this manually at first
- Tensorflow Serving is a great tool that handles this, as well as batching and overall deployment, very thoroughly. The downsides are that it's a bit hard to setup and to write client code for, and in addition doesn't support Caffe/PyTorch
- How to migrate your ML code off Matlab
- Don't try to use Matlab in production. Just don't
- GPU drivers, Cuda, CUDNN
- Use nvidia-docker and try to find some Dockerfiles online
- There's also some work that goes into managing GPU resources, if you have more than one per box. Marathon/Mesos does this well, but at Hive we use a homebrewed tool that supports fractional GPUs
- Postprocessing
- Generally you'll want a frontend to present the ML results, but it's also a good idea to have an intermediate postprocessing layer so that you can make slight tweaks to the model results or confidences without having to redeploy a second classifier.
- Once you get a few different ML models in production, it's also common to mix and match them for different use cases -- run model A only if models B and C are both inconclusive, run model D in Caffe and pass the results to model E in Tensorflow, etc.