Power Variability on GPUs

1. Getting started

Setup CUDA on cloud machines. Command reference
[Optional] If on CloudLab, setup storage using Command reference
Run make to compile cuda files
gen_data 25536 to generate matrices for SGEMM
Standalone: ./sgemm 25536 1 0 or use one of the run*.sh scripts to orchestrate parallel runs
For simultaneous runs, setup noise workloads as outlined below and use the launch scripts

We designed the experiment space accounting for two things: statistical variation (taking mean between runs) and power variation (actual variation in system parameters we are trying to observe)
Run nvprof on all to ensure the same overheads across GPUs. Also, nvprof continuous collection does not capture kernel details running on other GPUs but still captures temperature, frequency, voltage on other GPUs
V100 does not support disabling autoboost. We observed overshoot about 300W upto 325W in many cases.

CUDA version of a sgemm kernel is included in this repository

Compile sgemm and gen_data using: make

Before you run the kernel you need to generate the data using the following: gen_data <square matrix dimension>

The compiled binary can be run from the command line as follows: sgemm <square matrix dimension> <number of repetitions> <target GPU Id>

Profiling:

On V100, where nvprof is supported to get system metrics nvprof --print-gpu-trace --event-collection-mode continuous --system-profiling on --kernel-latency-timestamps on --csv --log-file sgemm_test.csv --device-buffer-size 128 --continuous-sampling-interval 1 -f ./sgemm 2 1 0
To get metrics such as utilization for the sgemm kernel, do not use event-collection-mode sudo -E env PATH=$PATH nvprof --print-gpu-trace --kernel-latency-timestamps on --device-buffer-size 128 --continuous-sampling-interval 1 --metrics sm_efficiency,achieved_occupancy,sysmem_utilization -f ./sgemm 2 1 0
On Ampere/Turing where nvprof is not directly supported nsys nvprof sudo -E env PATH=$PATH nvprof --profile-from-start off --log-file test sgemm 2 1 0

Insights:

Choosing 25536 results in maximum compute utilization on the V100. Use profiling to ensure that this is tuned for specific GPUs

Reasoning behind choice: Vision Models (Compute intensive) and Language Models (Memory Intensive, DistributedDataParallel)

TinyML dataset wget https://image-net.org/data/tiny-imagenet-200.zip
Imagenet dataset wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar. Extract, pre-process with commands on repo. Use tmux and docker without rm to process the data and then copy it to outside the container. The processed data can be used while launching the container for training
Launch docker with: nvidia-docker run -it -v /store/DeepLearningExamples/PyTorch/Classification/imagenet:/imagenet --ipc=host nvidia_resnet50
Run training python ./main.py --arch resnet50 --data-backend pytorch --label-smoothing 0.1 /imagenet
Run training with nvprof nvprof --print-gpu-trace --event-collection-mode continuous --system-profiling on --kernel-latency-timestamps on --csv --log-file resnet.csv --device-buffer-size 128 --continuous-sampling-interval 1 -f python ./main.py --arch resnet50 --data-backend pytorch --batch-size 128 --epochs 1 --label-smoothing 0.1 /imagenet
Modified version to allow selecting GPU core to run on in RESNET nvprof --print-gpu-trace --event-collection-mode continuous --system-profiling on --kernel-latency-timestamps on --csv --log-file resnet.csv --device-buffer-size 128 --continuous-sampling-interval 1 -f python ./main.py --arch resnet50 --data-backend pytorch --batch-size 128 --epochs 2 --gpu_core 2 --label-smoothing 0.1 /imagenet

Launch docker with nvidia-docker run -it --ipc=host -v /store/DeepLearningExamples/PyTorch/Translation/Transformer/data/data/wmt14_en_de_joined_dict:/data/wmt14_en_de_joined_dict your.repository:transformer bash
Run preprocessing scripts/run_preprocessing.sh
Run training scripts/run_training.sh

TBD

Kill processes on GPUs sudo fuser -k /dev/nvidia0/1/2/3
nvidia-smi in continuous monitoring mode $ watch -n 1 nvidia-smi
Querying stats from nvidia smi nvidia-smi --format=csv --query-gpu=power.min_limit

Achieve fairness and predictable performance in large cluster deployments for ML training, but not microsecond level predictability. Variations are natural in interactive systems.
If GPU0 is most frequently used and is susceptible to wear, can I save money on large scale training by just asking for the 4th GPU always? Wear levelling for GPUs. Idle power example:
Everyone wants 100% utilization and 300W of power, Rowhammer for GPUs, make surrounding GPUs slower?
Second order effects: Temperature, cooling, placement in cluster
Selecting a GPU based on utilization in commandline https://stackoverflow.com/questions/39649102/how-do-i-select-which-gpu-to-run-a-job-on
Gavel reference: ![1](images/2022-04-07-10-40-45.png
Exploiting scale in both training data and model size has been central to the success of deep learning
A model for prediction of variability?
Continuous mode
SMI query reference for persistence and other settings
Any settings below for clocks and power get reset between program runs unless you enable persistence mode (PM) for the driver.