Info

needed more straightforward way to compare the fsdp vs non-fsdp performance and whether alternate models are easy to integrate/use with fsdp or if they require alterations.

strain.py is for model training with simple sharding (aka device_map="auto") and without fsdp. train.py uses fsdp for model training.

run it up

to run mistral fsdp example: torchrun --nnodes=1 --nproc-per-node=2 train.py --wandb_mode=online --wandb_group="fsdp/mistral-7b"

to run with a different model must supply decoder layer like:

torchrun --nnodes=1 --nproc-per-node=8 train.py --wandb_mode=online --wandb_group="fsdp/fuyu-8b" --model_name="adept/fuyu-8b" --decoder_layer_import="transformers.models.persimmon.modeling_persimmon,PersimmonDecoderLayer"

other notes

4/25/24
- recent update of transformers/pytorch seems like tons of fsdp stuff is working now on machine with 8xGpus.

original readme

from https://github.com/abacaj/fine-tune-mistral

fine-tune-mistral

Code used to fine-tune this model: abacaj/mistral-7b-sft. Add your data in the data folder as train.jsonl and validation.jsonl.

How to run

Install dependencies:

python -m venv env \
  && source env/bin/activate \
  && pip install -r requirements.txt

Run training code:

torchrun --nnodes=1 --nproc-per-node=<REPLACE_WITH_NUMBER_OF_GPUS> train.py

Tips