OOM Error on 8xA100

Question

OOM Error on 8xA100

Closed this issue 3 months ago · 1 comments

Hello!

Thank you again for this work. I am trying to fine-tune some custom data (pretty much region-based GCG) using deepspeed with your train_ft.py script. I am getting OOM errors on a node with 8xA100 80GB -- is this similar to your training set up?

This is the command I am using

#!/bin/bash
PRETRAINED_HF_PATH=MBZUAI/GLaMM-GranD-Pretrained #MBZUAI/GLaMM-FullScope
GROUNDING_ENC_CKPT_PATH=../../SAM-fine-tune/sam_vit_h_4b8939.pth
export MASTER_PORT=$(shuf -i 2000-65000 -n 1)

deepspeed --master_port $MASTER_PORT train_ft.py --version $PRETRAINED_HF_PATH --dataset_dir /mnt/localssd/PIN --vision_pretrained $GROUNDING_ENC_CKPT_PATH --lora_r 8 --lr 3e-4 --pretrained \
        --exp_name ./experiment --epochs 10 --steps_per_epoch 500 --no_eval #mask_validation

Thank you!
Josh

Answer 1 · 2024-10-05T16:07:30.000Z

This ended up being because my question and conversations were not being returned as a list