OOM Error on 8xA100
Closed this issue · 1 comments
joshmyersdean commented
Hello!
Thank you again for this work. I am trying to fine-tune some custom data (pretty much region-based GCG) using deepspeed with your train_ft.py script. I am getting OOM errors on a node with 8xA100 80GB -- is this similar to your training set up?
This is the command I am using
#!/bin/bash
PRETRAINED_HF_PATH=MBZUAI/GLaMM-GranD-Pretrained #MBZUAI/GLaMM-FullScope
GROUNDING_ENC_CKPT_PATH=../../SAM-fine-tune/sam_vit_h_4b8939.pth
export MASTER_PORT=$(shuf -i 2000-65000 -n 1)
deepspeed --master_port $MASTER_PORT train_ft.py --version $PRETRAINED_HF_PATH --dataset_dir /mnt/localssd/PIN --vision_pretrained $GROUNDING_ENC_CKPT_PATH --lora_r 8 --lr 3e-4 --pretrained \
--exp_name ./experiment --epochs 10 --steps_per_epoch 500 --no_eval #mask_validation
Thank you!
Josh
joshmyersdean commented
This ended up being because my question and conversations were not being returned as a list