High Accuracy Low Precision Training (HALP)

HALP is a PyTorch-based simulator for the HALP (High Accuracy Low Precision) training algorithm. HALP is a low-precision stochastic gradient descent variant that uses entirely low-precision computation in its inner loop while infrequently recentering this computation with higher-precision computation done in an outer loop. HALP anchors on two key components: (1) a known variance reduction method based on stochastic variance-reduced gradient (SVRG); (2) a novel bit centering technique that uses infrequent high-precision computation to reduce quantization noise. HALP_PyTorch is built on the IEEE float16 tensor and arithmetic provided by PyTorch. This implementation can be used to replicate our experiment results on multiple models, including logistic regression, LeNet, LSTM and ResNet.

Setup instructions

Create conda python 3.6 environment conda create -n <name of the environment> python=3.6
Install PyTorch. Our implementation is tested under PyTorch 0.4.1 using cuda 9.0 and torchvision 0.2.1.

pip install https://download.pytorch.org/whl/cu90/torch-0.4.1-cp36-cp36m-linux_x86_64.whl
pip install torchvision

Install nltk 3.3 to support data processing for the LSTM experiment: conda install -c anaconda nltk
Clone the HALP repo

git clone https://github.com/HazyResearch/halp.git

Setup HALP module for python

pip install -e halp
export PYTHONPATH="$PYTHONPATH:path to current directory/halp"

Command guidelines

Key arguments

Specify the model and dataset

The dataset and the model are specified via argument --dataset and --model. 
Our simulator currently supports:

* Logistic regression with MNIST dataset (--model=logreg --dataset=mnist)
* LeNet with CIFAR10 dataset (--model=lenet --dataset=cifar10)
* LSTM with CONLL2000 dataset (--model=lstm --dataset=conll2000)
* ResNet with CIFAR10 dataset (--model=resnet --dataset=cifar10)

Specify training algorithm

The HALP simulator currently support :
* IEEE float32 SGD:
  --solver=sgd --rounding=void
* IEEE float16 SGD:
  --solver=lp-sgd --rounding=near
* IEEE float32 SVRG:
  --solver=svrg --rounding=void -T=<# of steps between each full gradient compute>
* IEEE float16 SVRG:
  --solver=lp-svrg --rounding=near -T=<# of steps between each full gradient compute>
* IEEE float16 HALP:
  --solver=bc-svrg --rounding=near -T=<# of steps between each full gradient compute>.
  HALP can optionally use --on-site-compute. This mode avoid caching bit-centering offset 
  activation / gradient tensors for the whole dataset. Instead on-site-compute mode only 
  compute the offset when they are needed. This saves host memory in our simulator for large models.

* Misc specification
* --cuda: this must be specified as IEEE float16 arithmetic is only supported by gpu in PyTorch
* --alpha --momentum: learning and momentum value
* --reg: strength of L2 regularization
* --n-classes: # of classes for the classification problems
* --seed: specified the random seed
* --batch-size: the minibatch size
* --n-epochs: the total number of epochs for training

Example runs

We present the commands for several configurations as examples.

Logistic regression MNIST experiment:

(IEEE float16 HALP) cd ./exp_script && python run_models.py --n-epochs=100 --batch-size=100 --reg=1e-5 --alpha=0.05 --momentum=0.9 --seed=1  --n-classes=10  --solver=bc-svrg  --rounding=near  -T=600  --dataset=mnist  --model=logreg  --cuda

(IEEE float32 SGD) cd ./exp_script && python run_models.py --n-epochs=100 --batch-size=100 --reg=1e-5 --alpha=0.01 --momentum=0.9 --seed=1  --n-classes=10  --solver=sgd  --rounding=void --dataset=mnist  --model=logreg  --cuda

LeNet CIFAR10 experiment:

(IEEE float16 HALP) cd ./exp_script && python run_models.py --n-epochs=100 --batch-size=128 --reg=0.0005 --alpha=0.001 --momentum=0.9 --seed=1  --n-classes=10  --solver=bc-svrg  --rounding=near  -T=391  --dataset=cifar10  --model=lenet  --cuda

(IEEE float32 SVRG) cd ./exp_script && python run_models.py --n-epochs=100 --batch-size=128 --reg=0.0005 --alpha=0.001 --momentum=0.9 --seed=1  --n-classes=10  --solver=svrg  --rounding=void  -T=391  --dataset=cifar10  --model=lenet  --cuda

LSTM CONLL2000 experiment:

(Pre-process CONLL2000 tagging data) 
mkdir datasets
python ./utils/postag_data_utils.py

(IEEE float16 HALP) cd ./exp_script && python run_models.py --n-epochs=100 --batch-size=16 --reg=0.0 --alpha=0.5 --momentum=0.0 --seed=3  --n-classes=12  --solver=bc-svrg  --rounding=near  -T=279  --dataset=conll2000  --model=lstm  --cuda  --on-site-compute

(IEEE float16 SGD) cd ./exp_script && python run_models.py --n-epochs=100 --batch-size=16 --reg=0.0 --alpha=5.0 --momentum=0.0 --seed=1  --n-classes=12  --solver=lp-sgd  --rounding=near --dataset=conll2000  --model=lstm  --cuda  --on-site-compute

ResNet CIFAR10 fine-tuning experiment:

(IEEE float16 SGD model checkpoint collection) cd ./exp_script && python run_models.py --n-epochs=350 --batch-size=128 --reg=0.0005 --alpha=0.1 --momentum=0.9 --seed=1  --n-classes=10  --solver=lp-sgd  --rounding=near  -T=391  --dataset=cifar10  --model=resnet  --cuda  --resnet-save-ckpt  --resnet-save-ckpt-path=<folder path to save check point>

(IEEE float16 HALP warm start tuning run) cd ./exp_script && python run_models.py --n-epochs=100 --batch-size=128 --reg=0.0005 --alpha=0.1 --momentum=0.0 --seed=1  --n-classes=10  --solver=bc-svrg  --rounding=near  -T=391  --dataset=cifar10  --model=resnet  --cuda  --on-site-compute --resnet-load-ckpt  --resnet-save-ckpt-path=<path to the saved model check point> --resnet-load-ckpt-epoch-id=300

Acknowledgements

We thank Nimit Sohoni, Paroma Varma, Albert Gu, Tri Dao, Charles Kuang for the helpful discussion.