Foundation Model Gym
纸上得来终觉浅,绝知此事要躬行。
pytorch warmup
Step 0.Transformer (1 week)
Step 1.targets
- a transformer implementation from scratch
requirements
- pytorch Transformer API (fine-grained)
- benchmarks (comparing with pytorch's implementation)
- the Transformer paper
outputs
- a transformer matching reported performances
- implementation notes/wiki
core models (1 week)
Step 2.targets
- the BERT implementation
- optimization skills (1 GPU + accumulated gradient)
requirements
- data pre-processing pipeline:
- tokenizers (Huggingface) + raw data --> pre-processed data --> truncated samples -->MLM/NSP training data (json)
- Huggingface BERT-base retrained with 10 million samples (reference full BERT base training requires 8*V100*4 days), and their learning curves/MLM accuracies/NSP accuracies
- 1 GPU with small batch sizes or accumulated gradients
- 8 GPUs with the official BERT setting
- the BERT paper
outputs
- BERT base model matching reported performances on benchmarks
- implementation notes/wiki
core peripherals (2 weeks)
Step 3.targets
- strategies on building vocabulary (tokenizers (word piece/sentence piece/bpe))
- strategies on positional embeddings
- masking, sampling, ...
requirements
- raw data
- Huggingface tokenizer API
- data preprocessing papers
outputs
- data pre-processing pipelines
- implementation notes/wiki
fine tune (1 week)
Step 4.targets
- fine-tuning skills
requirements
- GLUE evaluation toolkits
outputs
- fine tuned BERTs with reported GLUE performances
fm with decoders (2 weeks)
Step 5.targets
- fm with decoders (GPT, unilm, BART, T5)
requirements
- raw data
- Hugging face API
outputs
- fm models matching reported performances on benchmarks
- implementation notes/wiki
Step 6. useful extensions (2 weeks)
more fm models
- XLNet (different pre-training objectives)
- tinyBERT, ALBERT (parameter sharing and compressing)
- roBERTa (data scale-up)
more pre-training, fine-tuning tricks
- learning rates (layer-wise learning rates, warmup)
- training with resource constraints (early exit, accumulate gradient (batch size))
more data pre-processing
- backdoor injection