/Model-Parallelism-Workshop

Code and notes associated with the model parallelism workshop. The workshop was delivered by Nvidia.

Primary LanguageJupyter Notebook

Model-Parallelism-Workshop

Code and notes associated with the model parallelism workshop. The workshop was delivered by Nvidia.

Slides

You can access the workshop slides via this link. Slides 1 and 2 correspond to Lab 1 and Slides 3 correspond to Lab 2.

Lab 1

Lab 1 covers the material needed to scale the training of large neural models to multiple gpus.

Notebooks summary

  1. In notebook 01, we give an overview of the class environment and we introduce some basic slurm commands. Throughout the workshop, we will use two nodes each containing two gpus.
  2. In notebook 02, we give an introduction about the distributed training strategies and we use the pytorch distributed launcher to scale the pretraining of GPT to multiple gpus within the same node.
  3. In notebook 03, we scale the training to multiple nodes and we profile the training using the pytorch profiler. We also introduce the concept of hybrid parallelism via running the pretraining using tensor parallelism and pipeline parallelism.
  4. In notebook 04, we introduce possbile optimizations to the pretraining of gpt. We present concepts like mixed precision training, activation checkpointing, and gradient accumulation. Moreover, the notebook introduces some useful utils (computing number of parameters of a a model and estimating the peak flops and the amount of time needed to train the model)
  5. In notebook 05, we scale the training of an image classifier using deepspeed and the zero redundancy optimizer
  6. In notebook 06, we introduce the concept of a mixture of experts architecture and shows how we can add 'expert layers' to an architecture using deepspeed

Lab 2

Lab 2 covers the material needed to deploy a GPT model into production using the nvidia's Faster Transformer library and the nvidia's triton server

Notebooks summary

  1. In notebook 02, we deployed a 6B GPT-J model using nothing but pytorch and the transformers library. We used the deployed instance to perform few shots learning on the task of machine translation. Finally, we measured the inference time to use as a baseline for the next two notebooks.
  2. In notebook 03, we deployed the same model using the nvidia’s Faster Transformer Library. We ran the inference on one GPU and then we extended it to two GPUS using tensor parallelism.
  3. In notebook 04, we deployed the model into production using nvidia’s triton server

Papers

  1. Deep Learning Scaling is Predictable, Empirically
  2. Scaling Laws for Autoregressive Generative Modeling
  3. Language Models are Few-Shot Learners
  4. The Power of Scale for Parameter-Efficient Prompt Tuning
  5. Multitask Prompted Training Enables Zero-Shot Task Generalization
  6. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
  7. Training Compute-Optimal Large Language Models
  8. Sequence Parallelism: Long Sequence Training from System Perspective
  9. Reducing Activation Recomputation in Large Transformer Models
  10. ZeRO-Offload: Democratizing Billion-Scale Model Training
  11. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  12. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

External Links

  1. Megatron LM Documentation From HuggingFace
  2. Microsoft DeepSpeed introduction at KAUST
  3. Megatron LM Repo