GG-Training-Trick

General Trianing

Training AI models at a large scale

GETTING STARTED WITH FULLY SHARDED DATA PARALLEL(FSDP)

Bottleneck Issue

A Goldmine About training bottleneck from pytorch discussion forum

Webdataset: Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs

A reddit post about bottleneck

Batched data augmentations using Kornia since pytorch doesn't support it yet.

Nvidia's Blog post about Data transfer and benchmarking

Write Custome Stream Dataloader when the Dataset is too big to fit in memeory: Or sometimes just recreate new Dataset object that contains only part of the whole dataset during training, this way may need to take care of the sharding of the dataset, but you may also just skip it, duplicated sample may stabilize the training process. Make the sub Dataset big enough to fit the memeory, since batching many small transfers between host memory and GPU memory into one larger transfer performs much better because it eliminates most of the per-transfer overhead.

Paper: Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report

PyTorch 效能懶人包