/gala_tpu

Primary LanguagePython

Cloud TPUs

This repository is a collection of reference models and tools used with Cloud TPUs.

The fastest way to get started training a model on a Cloud TPU is by following the tutorial. Click the button below to launch the tutorial using Google Cloud Shell.

Open in Cloud Shell

If you would like to make any fixes or improvements to the models, please submit a pull request!

ResNet-50 on TPU

Prerequisites

Setup a Google Cloud project

Follow the instructions at the Quickstart Guide to get a GCE VM with access to Cloud TPU.

To run this model, you will need:

  • A GCE VM instance with an associated Cloud TPU resource
  • A GCS bucket to store your training checkpoints
  • (Optional): The ImageNet training and validation data preprocessed into TFRecord format, and stored in GCS.

Formatting the data

The data is expected to be formatted in TFRecord format, as generated by this script.

If you do not have ImageNet dataset prepared, you can use a randomly generated fake dataset to test the model. It is located at gs://cloud-tpu-test-datasets/fake_imagenet.

Training the model

Train the model by executing the following command (substituting the appropriate values):

python resnet_main.py \
  --tpu_name=$TPU_NAME \
  --data_dir=$DATA_DIR \
  --model_dir=$MODEL_DIR

If you are not running this script on a GCE VM in the same project and zone as your Cloud TPU, you will need to add the --project and --zone flags specifying the corresponding values for the Cloud TPU you'd like to use.

This will train a ResNet-50 model on ImageNet with 1024 batch size on a single Cloud TPU. With the default flags on everything, the model should train to above 76% accuracy in around 17 hours (including evaluation time every --steps_per_eval steps).

You can launch TensorBoard (e.g. tensorboard -logdir=$MODEL_DIR) to view loss curves and other metadata regarding your training run. (Note: if you launch on your VM, be sure to configure ssh port forwarding or the GCE firewall rules appropriately.)

Understanding the code

For more detailed information, read the documentation within each file.

  • imagenet_input.py: Constructs the tf.data.Dataset input pipeline which handles parsing, preprocessing, shuffling, and batching the data samples.
  • resnet_main.py: Main code which constructs the TPUEstimator and handles training and evaluating the model.
  • resnet_model.py: ResNet model code which constructs the network via modular residual blocks or bottleneck blocks.
  • resnet_preprocessing.py: Useful utilities for preprocessing and augmenting ImageNet data for ResNet training. Significantly improves final accuracy.

Additional notes

About the model and training regime

The model is based on network architecture presented in Deep Residual Learning for Image Recognition by Kaiming He, et. al.

Specifically, the model uses post-activation residual units for ResNet-18, and 34 and post-activation bottleneck units for ResNet-50, 101, 152, and 200. There are a few differences to the model and training compared to the original paper:

  • The preprocessing and data augmentation is slightly different. In particular, we have an additional step during normalization which rescales the inputs based on the stddev of the RGB values of the dataset.
  • We use a larger batch size of 1024 (by default) instead of 256 and linearly scale the learning rate. In addition, we adopt the learning rate schedule suggested by Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour and train for 90 epochs.
  • We use a slightly different weight initialization for batch normalization in the last batch norm per block, as inspired by the above paper.
  • Evaluation is performed on a single center crop of the validation set rather than a 10-crop from the original paper.

Training/evaluating/predicting on CPU/GPU

To run the same code on CPU/GPU, set the flag --use_tpu=False. This will use the default devices available to TensorFlow on your machine. The checkpoints created by CPU/GPU and TPU are all identical so it is possible to train on one type of device and then evaluate/predict using the trained model on a different device.

Serve the exported model on CPU/GPU

To serve the exported model on CPU, set the flag --data_format='channels_last' as inference on CPU only supports 'channels_last'. Inference on GPU supports both 'channels_first' and 'channels_last'.

Using different ResNet configurations

The default ResNet-50 has been carefully tested with the default flags but resnet_model.py includes a few other commonly used configurations including ResNet-18, 34, 101, 152, 200. The 18 and 34 layer configurations use residual blocks without bottlenecks and the remaining configurations use bottleneck layers. The configuration can be controlled via --resnet_size. Bigger models require more training time and more memory, thus may require lowering the --train_batch_size to avoid running out of memory.

Using your own data

To use your own data with this model, you first need to write an input pipeline similar to imagenet_input.py. It is recommended to use TFRecord format for storing your data on disk (see the ImageNet dataset download script for details) and tf.data.Dataset for the actual pipeline. Then, simply replace the current imagenet_input in resnet_main.py and adjust the dataset constants.

Benchmarking the training speed

Benchmarking code for DAWNBench can be found under the benchmark/ subdirectory. The benchmarking code imports the same models, inputs, and training regimes but includes some extra checkpointing and evaluation.