# of GPU issue
elPerro92 opened this issue · 6 comments
Hi, I'm training the images with this method, I have a PC with 2 GPUs (RTX2080) and on the train_config.py I have set the line:
config.TRAIN.num_gpu = 2
but whenever I start the training is only using the first GPU.
Hi, I'm training the images with this method, I have a PC with 2 GPUs (RTX2080) and on the train_config.py I have set the line:
config.TRAIN.num_gpu = 2
but whenever I start the training is only using the first GPU.
you still need to set the devices as visible , by os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
Thanks very much, now it's working, I'm new to TensorFlow ;). It is possible to resume the training from the last checkpoint?
Thanks very much, now it's working, I'm new to TensorFlow ;). It is possible to resume the training from the last checkpoint?
You're welcome
by setting config.MODEL.continue_train=True;
config.MODEL.pretrained_model='the_pretrained.ckpt';
And i suggest you that do not to use the codes now, becasue tf2.0 is released. It is better to learn the new one, and it is more friendly. And i am working on it : )
Thanks for the fast response, I'm using the 1.14-gpu version because i'm using a Nvidia Jetson for the landmark recognition and that is le lastest version for it. I will use the 2.0 when will be released the stable version stable for Jeston.
Hi, I've trained with 2 GPUs but the time to do one epoch is the same as with one GPU (56 minutes).
When I try to restore to the last checkpoint, it show on the terminal this error:
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Tensor name "ShuffleNetV2/Stage2/unit_1/conv1x1_after/BatchNorm/beta" not found in checkpoint files model/epoch_46L2_1e-05.ckpt.index
[[node save/RestoreV2 (defined at /home/USER/path/to/face_landmark-master/lib/core/base_trainer/net_work.py:72) ]]
how i can restore correctly the checkpoint?
Hi, I've trained with 2 GPUs but the time to do one epoch is the same as with one GPU (56 minutes).
When I try to restore to the last checkpoint, it show on the terminal this error:tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Tensor name "ShuffleNetV2/Stage2/unit_1/conv1x1_after/BatchNorm/beta" not found in checkpoint files model/epoch_46L2_1e-05.ckpt.index
[[node save/RestoreV2 (defined at /home/USER/path/to/face_landmark-master/lib/core/base_trainer/net_work.py:72) ]]how i can restore correctly the checkpoint?
hi,
it should be config.MODEL.pretrained_model= 'model/epoch_46L2_1e-05.ckpt' no .index