Exploring finetuning public checkpoints on filtered 8K sequences on Pile
CUDA_VISIBLE_DEVICES=0 HF_MODULES_CACHE=./cache/ HF_DATASETS_CACHE=./cache/ TRANSFORMERS_CACHE=./cache/ python finetune.py --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --output_dir pythia-1.4b --gradient_accumulation_steps 8 --fp16 --evaluation_strategy "epoch" --max_steps 100000 --model_name_or_path EleutherAI/pythia-1.4b
Note that this self-contained script holds everything you need to run this finetuning, as long as you set up dependencies, such as flash attention correctly. For a 1.3 B model, it should work on a single A100 80G.
HF_MODULES_CACHE=./cache/ HF_DATASETS_CACHE=./cache/ TRANSFORMERS_CACHE=./cache/ deepspeed --num_gpus=8 finetune.py --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --output_dir pythia-6.9b --gradient_accumulation_steps 8 --fp16 --evaluation_strategy "epoch" --max_steps 100000 --deepspeed ds_config.json --model_name_or_path EleutherAI/pythia-6.9b
If you hit "RuntimeError: Tensors must be contiguous" , follow this simple fix and modify your deepSpeed library
sbatch slurm.sh
Note that you can launch up to pythia-20B with 16 80GB A100s, aka two nodes. Since the above slurm script relies on openmpi, you should be able to generalize it to mroe than 2 nodes without problems.
Not much besides typical pytorch and transformers, the most likely issue will come from flash-attention, where you should follow exactly what the official repo, in better case, if you have the choice to use the docker provided, it will save you from many headaches.
- enable multiple GPUs and model parallel