When i tried to train the mode. There is a bug
Ryanye2000 opened this issue · 6 comments
Ryanye2000 commented
lkeab commented
what's your pytorch version and cuda version? Does the model inference normally?
Ryanye2000 commented
i have a version of this conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch
Ryanye2000 commented
what's your pytorch version and cuda version? Does the model inference normally?
and my cuda is 11.2. But i have used this kind of version to run another code already and it succeed
mzg0108 commented
I have got the same error.
the demo code works fine and generate the segmented results.
Have you found any solution?
crapthings commented
this work for me, lower the batch_size and nproc_per_node if u have only 1 gpu
torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_h_4b8939.pth --batch_size_train 16 --model-type vit_h --output work_dirs/hq_sam_h
torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_l_0b3195.pth --batch_size_train 16 --model-type vit_l --output work_dirs/hq_sam_l
halqadasi commented
I solved this problem on Google Colab:
- After libraries importing, write the following lines:
local_rank = int(os.environ["LOCAL_RANK"])
- Remove this line from train.py:
parser.add_argument('--local_rank', type=int, help='local rank for dist')
- Change the command from:
python -m torch.distributed.launch train.py TRAIN_ARGS
totorchrun train.py TRAIN_ARGS