/fm-gym

Primary LanguagePython

Foundation Model Gym

纸上得来终觉浅,绝知此事要躬行。

Step 0. pytorch warmup

Step 1. Transformer (1 week)

targets

  • a transformer implementation from scratch

requirements

  • pytorch Transformer API (fine-grained)
  • benchmarks (comparing with pytorch's implementation)
  • the Transformer paper

outputs

  • a transformer matching reported performances
  • implementation notes/wiki

Step 2. core models (1 week)

targets

  • the BERT implementation
  • optimization skills (1 GPU + accumulated gradient)

requirements

  • data pre-processing pipeline:
    • tokenizers (Huggingface) + raw data --> pre-processed data --> truncated samples -->MLM/NSP training data (json)
  • Huggingface BERT-base retrained with 10 million samples (reference full BERT base training requires 8*V100*4 days), and their learning curves/MLM accuracies/NSP accuracies
    • 1 GPU with small batch sizes or accumulated gradients
    • 8 GPUs with the official BERT setting
  • the BERT paper

outputs

  • BERT base model matching reported performances on benchmarks
  • implementation notes/wiki

Step 3. core peripherals (2 weeks)

targets

  • strategies on building vocabulary (tokenizers (word piece/sentence piece/bpe))
  • strategies on positional embeddings
  • masking, sampling, ...

requirements

  • raw data
  • Huggingface tokenizer API
  • data preprocessing papers

outputs

  • data pre-processing pipelines
  • implementation notes/wiki

Step 4. fine tune (1 week)

targets

  • fine-tuning skills

requirements

  • GLUE evaluation toolkits

outputs

  • fine tuned BERTs with reported GLUE performances

Step 5. fm with decoders (2 weeks)

targets

  • fm with decoders (GPT, unilm, BART, T5)

requirements

  • raw data
  • Hugging face API

outputs

  • fm models matching reported performances on benchmarks
  • implementation notes/wiki

Step 6. useful extensions (2 weeks)

more fm models

  • XLNet (different pre-training objectives)
  • tinyBERT, ALBERT (parameter sharing and compressing)
  • roBERTa (data scale-up)

more pre-training, fine-tuning tricks

  • learning rates (layer-wise learning rates, warmup)
  • training with resource constraints (early exit, accumulate gradient (batch size))

more data pre-processing

  • backdoor injection

Reference