# of GPU issue

Question

# of GPU issue

elPerro92 opened this issue 5 years ago · 6 comments

Hi, I'm training the images with this method, I have a PC with 2 GPUs (RTX2080) and on the train_config.py I have set the line:

config.TRAIN.num_gpu = 2

but whenever I start the training is only using the first GPU.

Answer 1 · 2019-10-08T10:47:39.000Z

Hi, I'm training the images with this method, I have a PC with 2 GPUs (RTX2080) and on the train_config.py I have set the line:

config.TRAIN.num_gpu = 2

but whenever I start the training is only using the first GPU.

you still need to set the devices as visible , by os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

Answer 2 · 2019-10-08T11:03:07.000Z

Thanks very much, now it's working, I'm new to TensorFlow ;). It is possible to resume the training from the last checkpoint?

Answer 3 · 2019-10-08T11:18:52.000Z

Thanks very much, now it's working, I'm new to TensorFlow ;). It is possible to resume the training from the last checkpoint?

You're welcome
by setting config.MODEL.continue_train=True;
config.MODEL.pretrained_model='the_pretrained.ckpt';

And i suggest you that do not to use the codes now, becasue tf2.0 is released. It is better to learn the new one, and it is more friendly. And i am working on it : )

Answer 4 · 2019-10-08T12:41:53.000Z

Thanks for the fast response, I'm using the 1.14-gpu version because i'm using a Nvidia Jetson for the landmark recognition and that is le lastest version for it. I will use the 2.0 when will be released the stable version stable for Jeston.

Answer 5 · 2019-10-10T11:51:09.000Z

Hi, I've trained with 2 GPUs but the time to do one epoch is the same as with one GPU (56 minutes).
When I try to restore to the last checkpoint, it show on the terminal this error:

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Tensor name "ShuffleNetV2/Stage2/unit_1/conv1x1_after/BatchNorm/beta" not found in checkpoint files model/epoch_46L2_1e-05.ckpt.index
[[node save/RestoreV2 (defined at /home/USER/path/to/face_landmark-master/lib/core/base_trainer/net_work.py:72) ]]

how i can restore correctly the checkpoint?

Answer 6 · 2019-10-10T12:31:04.000Z

Hi, I've trained with 2 GPUs but the time to do one epoch is the same as with one GPU (56 minutes).
When I try to restore to the last checkpoint, it show on the terminal this error:

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Tensor name "ShuffleNetV2/Stage2/unit_1/conv1x1_after/BatchNorm/beta" not found in checkpoint files model/epoch_46L2_1e-05.ckpt.index
[[node save/RestoreV2 (defined at /home/USER/path/to/face_landmark-master/lib/core/base_trainer/net_work.py:72) ]]

how i can restore correctly the checkpoint?

hi,
it should be config.MODEL.pretrained_model= 'model/epoch_46L2_1e-05.ckpt' no .index