SysCV/sam-hq

When i tried to train the mode. There is a bug

Ryanye2000 opened this issue · 6 comments

屏幕截图 2023-11-23 181537 I found a bug when i ran the training code. I only have one gpu, so i set the --nproc_per_node to one,but the bug triggered. I do not know why

what's your pytorch version and cuda version? Does the model inference normally?

i have a version of this conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

what's your pytorch version and cuda version? Does the model inference normally?

and my cuda is 11.2. But i have used this kind of version to run another code already and it succeed

I have got the same error.
the demo code works fine and generate the segmented results.
Have you found any solution?

this work for me, lower the batch_size and nproc_per_node if u have only 1 gpu


torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_h_4b8939.pth --batch_size_train 16 --model-type vit_h --output work_dirs/hq_sam_h

torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_l_0b3195.pth --batch_size_train 16 --model-type vit_l --output work_dirs/hq_sam_l

I solved this problem on Google Colab:

  • After libraries importing, write the following lines:
    local_rank = int(os.environ["LOCAL_RANK"])
  • Remove this line from train.py:
    parser.add_argument('--local_rank', type=int, help='local rank for dist')
  • Change the command from:
    python -m torch.distributed.launch train.py TRAIN_ARGS to torchrun train.py TRAIN_ARGS