Prepare the Vicuna weights
13b
bash make_vicuna_weights.sh
Training
Support
- Just support these GPU systems:
H100
,A100
,RTX 3090
,T4
,RTX 2080
.
Fine-tuning Vicuna-7B with Local GPUs
torchrun
is a utility for running distributed PyTorch jobs.
Options
--nproc_per_node
: Number of GPUs to use per node. Default is 1.--master_port
: The port to use for communication between processes. Default is 20001.FastChat/fastchat/train/train_mem.py
: The script to run using torchrun.--model_name_or_path
: The name or path of the pretrained model to use.--data_path
: The path to the training data.--bf16
: Whether to use bfloat16 precision. Default is False.--output_dir
: The directory to save output files.--num_train_epochs
: The number of training epochs to run. Default is 3.--per_device_train_batch_size
: The batch size per GPU for training. Default is 2.--per_device_eval_batch_size
: The batch size per GPU for evaluation. Default is 2.--gradient_accumulation_steps
: The number of batches to accumulate gradients over. Default is 16.--evaluation_strategy
: The strategy for evaluating the model. Default is "no".--save_strategy
: The strategy for saving the model. Default is "steps".--save_steps
: The number of steps between each save. Default is 1200.--save_total_limit
: The maximum number of checkpoints to keep. Default is 10.--learning_rate
: The learning rate for training. Default is 2e-5.--weight_decay
: The weight decay to use for training. Default is 0.0.--warmup_ratio
: The ratio of steps to use for warming up the learning rate. Default is 0.03.--lr_scheduler_type
: The type of learning rate scheduler to use. Default is "cosine".--logging_steps
: The number of steps between each logging. Default is 1.--fsdp
: The Fully Sharded Data Parallelism (FSDP) configuration. Default is "full_shard auto_wrap".--fsdp_transformer_layer_cls_to_wrap
: The name of the Transformer layer class to wrap in FSDP. Default is "LlamaDecoderLayer".--tf32
: Whether to use tf32 precision. Default is False.--model_max_length
: The maximum length of input sequences for the model. Default is 2048.--gradient_checkpointing
: Whether to use gradient checkpointing. Default is False.--lazy_preprocess
: Whether to lazily preprocess the input data. Default is False.
Example command to train Vicuna-7B with 2 x A100 (40GB).
torchrun --nproc_per_node=2 --master_port=20001 FastChat/fastchat/train/train_lora.py \
--model_name_or_path vicuna_weights/13b \
--data_path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json \
--bf16 True \
--output_dir output \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
--torch_compile True
Only one machine
python FastChat/fastchat/train/train_lora.py \
--model_name_or_path vicuna_weights/13b \
--data_path ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json \
--bf16 True \
--output_dir output \
--num_train_epochs 3 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--lazy_preprocess True \
--report_to wandb \
--torch_compile True
conda create -n dev -c pytorch-nightly -c nvidia -c pytorch -c conda-forge python=3.8 pytorch torchaudio cudatoolkit pandas numpy
Serving
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-name 'demandgpt-v1.0' --model-path vicuna_weights/13b/
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md
UI
docker build -t chatgpt-ui .
docker run -e OPENAI_API_KEY=EMPTY -e DEFAULT_MODEL=demandgpt-v1.0 -e OPENAI_API_HOST=<API_HOST> -p 3000:3000 chatgpt-ui
nginx running docker
sudo docker run --name ui_nginx -P -d nginx
sudo apt install nginx
content of 34.31.56.228
server {
listen 80;
server_name 34.31.56.228;
location / {
proxy_pass http://localhost:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
}
sudo mv 34.31.56.228 /etc/nginx/sites-available/
sudo ln -s /etc/nginx/sites-available/34.31.56.228 /etc/nginx/sites-enabled/
Test
sudo nginx -t