trackmania-rl/tmrl

Not using GPU in training

Zach3292 opened this issue · 13 comments

As mentionned in the title, the program doesn't seem to use my NVIDIA GPU (3050 ti) to train. Instead, CPU usage jumps to 100%

TMRL result

Hi, have you set the trainer to cuda in config.json?

CUDA_TRAINING is set to true but CUDA_INFERENCE is set to false

I didn't change the config file, it is still default

Strange, the trainer terminal should be using your GPU when running training steps then. Can you try to open another terminal and run nvidia-smi while training steps are being performed?

So this is what i got from nvidia-smi while running the training, still no gpu usage in task manager
Screenshot 2022-07-22 172029

Nvidia-smi says that 50% of your GPU memory is used, but I am not sure whether this is from Trackmania or from the trainer terminal. What happens if you close the worker terminal and the game and execute nvidia-smi while the trainer terminal is still performing training steps?

So when closing the game and the worker, the trainer still uses the CPU at 95%+ but the vram usage in nvidia-smi dropped to 1% so it's only the game using it I believe
image

So weird, I would expect pytorch to throw an error if it cannot use CUDA for any reason when CUDA_TRAINING is true

This is when I only use the laptop, I'll try later with my main computer as the server and trainer to see if something similar happens

I found this https://discuss.pytorch.org/t/nvidia-geforce-rtx-3050-ti-laptop-gpu-with-cuda-capability-sm-86-is-not-compatible-with-the-current-pytorch-installation/143837

Even though I don't really understand everything in it, I thought it might give you a clue as to what the problem may be

Yes that is the setting we use for real training. I have never tried CUDA-enabled training locally on my laptop because I don't even have a CUDA-enabled version of pytorch on my laptop, I just use it to run the worker. Still, sounds weird that the worker doesn't saturate your laptop GPU, perhaps the CPU is a huge bottleneck in your setting, IDK

I found this https://discuss.pytorch.org/t/nvidia-geforce-rtx-3050-ti-laptop-gpu-with-cuda-capability-sm-86-is-not-compatible-with-the-current-pytorch-installation/143837

Even though I don't really understand everything in it, I thought it might give you a clue as to what the problem may be

Yup, sounds relevant, perhaps your pytorch installation is not compatible with your CUDA version (11.7 according to nvidia-smi)?

Hi, could you solve/locate the issue?

Closing for inactivity as I cannot reproduce the issue, please reopen if you experience something similar