the device used in training

Question

the device used in training

Opened this issue 10 months ago · 1 comments

what device did you use in training, I use 512 per V100 16GB lead to an OOM error.
but if I use a small batch, the loss go to NaN

train_fasternet_m(){
    python train_test.py -g 0,1 --num_nodes 1 -n 4 -b 1024 -e 500 \
        --pin_memory --wandb_project_name fasternet \
        --model_ckpt_dir ./model_ckpt/$(date +'%Y%m%d_%H%M%S') --cfg cfg/fasternet_m.yaml
}

Answer 1 · 2024-05-23T13:35:51.000Z

what device did you use in training, I use 512 per V100 16GB lead to an OOM error. but if I use a small batch, the loss go to NaN
train_fasternet_m(){
    python train_test.py -g 0,1 --num_nodes 1 -n 4 -b 1024 -e 500 \
        --pin_memory --wandb_project_name fasternet \
        --model_ckpt_dir ./model_ckpt/$(date +'%Y%m%d_%H%M%S') --cfg cfg/fasternet_m.yaml
}

hello，did you solve it? it occurs in my experiments.