baaivision/EVA

cannot run the detection training with mutiple gpu on a single node

lixi92 opened this issue · 3 comments

lixi92 commented

i tried to run the the code for detection training with mutiple gpu on a single node:

    python tools/lazyconfig_train_net.py --num-gpus 2 \
    --num-machines 1 --machine-rank 0 --dist-url "tcp://127.0.0.1:60900" \
    --config-file projects/ViTDet/configs/fetus/cascade_mask_rcnn_vitdet_eva.py \
    "train.init_checkpoint='eva_o365.pth'" \
    "train.output_dir='output'"

projects/ViTDet/configs/fetus/cascade_mask_rcnn_vitdet_eva.py is my custom dataset config file
but i got nothing output
what was going wrong?

What was the issue?? It would help me

lixi92 commented

There was no problem with the official code, it was the code I added

This script worked with the official code

Okay great thank you. Did you manage to use it on 2 nodes at the same time? I tried to change --num-machines 1 by --num-machines 2 but it does not work. Maybe launching the two scripts at the same time on 2 different notes and settings the same dist-url for communication?

Also how did you launch the inference?