When i tried to train the mode. There is a bug

Question

When i tried to train the mode. There is a bug

Ryanye2000 opened this issue 7 months ago · 6 comments

Ryanye2000 commented 7 months ago

I found a bug when i ran the training code. I only have one gpu, so i set the --nproc_per_node to one,but the bug triggered. I do not know why

Answer 1 · 2023-11-24T09:30:44.000Z

what's your pytorch version and cuda version? Does the model inference normally?

Answer 2 · 2023-11-25T08:54:30.000Z

i have a version of this conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

Answer 3 · 2023-11-25T08:55:34.000Z

what's your pytorch version and cuda version? Does the model inference normally?

and my cuda is 11.2. But i have used this kind of version to run another code already and it succeed

Answer 4 · 2023-12-06T01:19:32.000Z

I have got the same error.
the demo code works fine and generate the segmented results.
Have you found any solution?

Answer 5 · 2023-12-06T07:08:32.000Z

this work for me, lower the batch_size and nproc_per_node if u have only 1 gpu


torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_h_4b8939.pth --batch_size_train 16 --model-type vit_h --output work_dirs/hq_sam_h

torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_l_0b3195.pth --batch_size_train 16 --model-type vit_l --output work_dirs/hq_sam_l

Answer 6 · 2024-01-22T10:23:59.000Z

I solved this problem on Google Colab:

After libraries importing, write the following lines:
local_rank = int(os.environ["LOCAL_RANK"])
Remove this line from train.py:
parser.add_argument('--local_rank', type=int, help='local rank for dist')
Change the command from:
python -m torch.distributed.launch train.py TRAIN_ARGS to torchrun train.py TRAIN_ARGS