FastTFWorkflow

Tutorial about How to change your slow tensorflow training faster

Description

THIS CODE ONLY WORKS ON NVIDIA GPUS

Assuming dataset length is infinite, lnline preprocessing can cause CPU bottleneck that can decrease training throughput.
This code samples show unoptimized/optimized tensorflow workflow.

Requirements

Hardware Requirements

x86-64 (AMD64) CPU
RAM >= 8GiB
NVIDIA Computer Capability 7.0+ GPUs
- GPU memory > 12GiB for default batch size

Test Environment

CPU : Intel(R) Xeon(R) Gold 5218R
GPU : 2x A100 80GB PCI-E
RAM : 255GiB

Optimization used in this repo

Nvidia DALI - GPU Accelerated Dataloader
Mixed Precision - Better MMA(Matrix Multiply-Accumulate) throughput than TF32
XLA - JIT-Compile and fuse operators to effective job scheduling in GPUs
(Optional) Multi GPU training - Use more then one GPU for training

Usage

Clone this repo with submodule

git clone --recursive https://github.com/ReturnToFirst/FastTFWorkflow.git

Compare performance between unoptimized/optimized workflow

For advanced users

after_optimization_multi.ipynb shows training process with multi gpu.

DISCLAIMER

Depanding on devices in computer, performance can be decreased.
This optimized code will not show best performance.
Multi-GPUs Training doesn't works on test envrionment.
Wrong description or code there.