/AC-SGD

Code associated with the paper **Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees**.

Primary LanguagePythonMIT LicenseMIT

Activation Compression with Guarantees

We explore how to fine-tune language models over slow networks using activation compression with guarantees. This is a research project developed by DS3Lab@ETH Zurich and HazyResearch@Stanford.

Cite Our Paper

@article{jue2022fine,
  title={Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees}, 
  author={Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Re, Ce Zhang},
  year={2022},
  eprint={2206.01299},
  archivePrefix={arXiv},
  primaryClass={cs.DC}
}

Setup:

  • Create environment:

    conda create -n acsgd python=3.8
    conda activate acsgd
  • Install PyTorch env:

    pip3 install torch==1.9.0+cu111 torchtext -f https://download.pytorch.org/whl/torch_stable.html
    
    # Magic, not sure why cupy-cuda111 would not work, it seems that cupy-cuda111 will use different PTX from torch.
    pip3 install cupy-cuda110==8.6.0

    Other dependencies:

    pip3 install datasets==2.2.2
    pip3 install transformers==4.19.2
    pip3 install sentencepiece==0.1.96 # required by deberta
  • Setup network configuration:

    export GLOO_SOCKET_IFNAME=ens3
    export NCCL_SOCKET_IFNAME=ens3

Run Distributed Gpipe:

  • Partition the pre-trained model:

    # gpt2
    python convert_gpt2_checkpoint --model-name gpt2-xl --save-dir checkpoints/
        
    # or deberta 
    python convert_deberta_checkpoint --model-name deberta-v2-xxl --save-dir checkpoints/
  • On each node, run:

    # gpt2
    python dist_lm_runner.py $(echo ${ARGS}) --cuda-id 0 --rank i # (i=0,...,N-1)
        
    # or deberta
    python dist_deberta_runner.py $(echo ${ARGS}) --cuda-id 0 --rank i # (i=0,...,N-1)

    where "ARGS" contains training-related configurations, which should remain the same across nodes. An example could be:

    ARGS="--model-name checkpoints/gpt2-xl \
      --tokenizer-name gpt2-xl \
      --load-pretrained-model true \
      --task-name wikitext --n-epochs 10 --warmup-epochs 1 \
      --num-layers 6 --num-heads 25 --embedding-dim 1600 \
      --num-iters 10000000 --lr 5e-5 --seq-length 1024 --batch-size 32 --micro-batch-size 1 \
      --forward-compress-method delta \
      --forward-bits 4 \
      --backward-compress-method fixpoint \
      --backward-bits 8 \
      --dist-url tcp://XXX.XXX.XXX.XXX:9000 \
      --world-size N --pipeline-group-size N \
      --pp-mode gpipe --profiling no-profiling --do-evaluation true"

    Modify "--dist-url", "--world-size" and "--pipeline-group-size" before running.

    Complete examples can be found "./run_lm.sh" and "./run_deberta.sh".

Arguments

Distributed Related

  • "--dist-url": tcp://XXX.XXX.XXX.XXX:9000
  • "--world-size": number of nodes that participate in the training.
  • "--pipeline-group-size": number of nodes that perform pipeline parallelism.
  • "--data-group-size": number of nodes that perform data parallelism.
  • "--rank": the rank of the current node. (0, ..., world_size-1)
  • "--profiling": "no-profiling" or "tidy_profiling". If "tidy_profiling", a trace file will be generated in "./trace_json/", which can be visualized with "chrome://tracing/".

Compression Related

  • "--forward-compress-method": "none", "fixpoint", "delta", or "delta-lowbits".
    • "none": do not compress.
    • "fixpoint": direct compress the activations. need to specify `"--forward-bits".
    • "delta": compress and communicate the delta of activations. need to specify "--forward-bits" and "--max-activation-cache-size".
    • "delta-lowbits": in addition to "delta", it also compresses the local cache (previous activations). need to specify "--forward-bits", "--forward-bits-act", and "--max-activation-cache-size".
  • "--backward-compress-method": "none" or "fixpoint".
    • "none": do not compress.
    • "fixpoint": direct compress the gradients. need to specify "--backward-bits".

Training Related

  • "--batch-size": macro-batch size.
  • "--micro-batch-size ": micro-batch-size. The macro-batch size should be divisible by micro-batch-size.
  • "--lr": the peak learning rate.
  • "--n-epochs": number of training epochs.
  • "--warmup-epochs": number of epochs for uncompressed training (transfer full-precision activations and gradients).
  • "--warmup-steps": number of training steps where the learning rate grows from 0 to "--lr". Default to be one training epoch.
  • "--do-evaluation": whether do evaluation during training.

Model Related

  • "--model-name": Name or path of the pretrained checkpoint. Usually should be a path to the checkpoint generated by "convert_xxx_checkpoint.py".
  • "--tokenizer-name": Name or path of the tokenizer.
  • "--load-pretrained-model": whether to load the pretrained checkpoint. The checkpoint should be generated by "convert_xxx_checkpoint.py".
  • "--num-layers", "--num-heads", "--embedding-dim" should be inline with the configuration of "--model-name".