4 Simple Steps for running Distributed TensorFlow on Kubernetes using ACS-Engine

1. Create a Kubernetes cluster

Follow the instructions here to create a Kubernetes cluster using acs-engine.
Note that you can use Azure Container Service to create a Kubernetes cluster as well, but currently ACS does not support heterogeneous agent pools yet. So if you need a cluster with different size VMs, e.g. some with CPUs only and some with GPUs, you'll need to use acs-engine for now.

2. Setup GPU drivers on the agent host VMs with GPUs

Execute the scripts here on each agent host VM with GPUs.

3. Create a set of pods for distributed TensorFlow

For example, to create 2 parameter servers and 2 worker, use these sample yaml files to create pods and services in your cluster. Note that you should mount Azure Files as a PV for central storage for saving model checkpoints, etc. (For more info on how to use Azure Files with Kubernetes, go to https://docs.microsoft.com/en-us/azure/aks/azure-files) Once done, run kubectl get pods, and you should see 4 pods returned.

Run kubectl get svc to get the service IP address of each ps/worker node, as shown below.

    tf-ps0       10.0.211.196           6006/TCP,2222/TCP      
    tf-ps1       10.0.81.168            6006/TCP,2222/TCP  
    tf-worker0   10.0.221.23            6006/TCP,2222/TCP  
    tf-worker1   10.0.118.248           6006/TCP,2222/TCP        

4. Run Distributed TensorFlow training job

For example, to run a distributed TensorFlow training job using 1 parameter server and 2 workers, execute the following commands in the order listed below.
The sample tensorflow script used below can be downloaded from here.

a) on the parameter server pod, execute the script below:

python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
--batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
--data_name=imagenet --data_dir=/imagenetdata --job_name=ps --ps_hosts=10.0.211.196:2222 \
--worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=0

b) on the worker0 pod, execute the script below:

python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
--batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
--data_name=imagenet --data_dir=/imagenetdata --job_name=worker --ps_hosts=10.0.211.196:2222 \
--worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=0

c) on the worker1 pod, execute the script below:

python tf_cnn_benchmarks.py --local_parameter_device=cpu --num_gpus=4 \
--batch_size=128 --model=googlenet --variable_update=parameter_server --num_batches=200 --cross_replica_sync=False \
--data_name=imagenet --data_dir=/imagenetdata --job_name=worker --ps_hosts=10.0.211.196:2222 \
--worker_hosts=10.0.221.23:2222,10.0.118.248:2222 --task_index=1

If you need a solution that covers and automates the setup steps above, check out Deep Learning Workspace from Microsoft Research, an open source toolkit empowering DL workloads using Kubernetes.

Deep Learning Workspace's alpha release is available at https://github.com/microsoft/DLWorkspace/, with documentation at https://microsoft.github.io/DLWorkspace/