CUDA on WSL hangs after ~1h training
FremyCompany opened this issue · 8 comments
Windows Build Number
Microsoft Windows [Version 10.0.22458.1000]
WSL Version
- WSL 2
- WSL 1
Kernel Version
5.4.91
Distro Version
Ubuntu 20.04
Other Software
No response
Repro Steps
While training DNN models using an NVIDIA GPU using CUDA on WSL2, the training eventually comes to a stop while hanging. This does not result in a crash, so the training is just stuck indefinitely.
Expected Behavior
Running CUDA code in WSL2 should be stable.
Actual Behavior
Running CUDA code in WSL2 results in hang of the CUDA application.
Diagnostic Logs
I have the issue myself, and noticed others face the same issue recently, as evidenced by the following thread on NVIDIA forum:
https://forums.developer.nvidia.com/t/training-wsl-2-cuda-hangs-over-several-training-steps/176225/6
Windows Build Number
Edition Windows 11 Pro for Workstations Insider Preview
Version Dev
Installed on 25.9.2021.
OS build 22463.1000
Experience Windows Feature Experience Pack 1000.22463.1000.0
Kernel Version
5.10.43.3-microsoft-standard-WSL2
Distro Version
Ubuntu 20.04
Repro Steps
Have same problem on Windows 11 after before last update, last update was hope but still not working. Training just freeze, GPU usage go to 0% but VRAM stay at 90%, RAM usage stay at high usage and CPU usage go to low ~ 15%.
Freeze happens randomly, not only when training is running
I tried cuDNN on CPU only and same happen.
When freeze resources will not free until pc restart or wsl --shutdown. I tried :
- killing process using
$ kill -9 -1
(not working) - closing terminal
(not working) - open new terminal and killing processes
(not working) - wsl --shutdown
(works)
after wsl --shutdown I can run instance again but it freeze again in ~ 1-2 hours of using sometimes sooner.
Could you try updating to the latest kernel version and then see if you still see this issue? We believe that 5.10.60.1 has a fix that might resolve this.
Please run wsl --update
to update, and then you can verify your kernel version by running uname -a
inside of a Linux instance.
Though this is difficult to be 100% sure in my case (given long MTBF), it does appear updating the kernel fixed this issue for me.
Thanks for the hint :)
Great! Well it seems like this is the likely fix here. I'll close this issue, and we can reopen it if this problem comes up again. Thank you for filing this!
i'm experiencing the hanging issue described above as well. a year later. CUDA 11.7, WSL2, ubuntu 20.0.4. tried wsl --update
. for small models it's fine. but for larger ones i'm guessing it runs out of memory in an ungraceful way. this same larger model though works fine on a linux box with the same nvidia card.
The update fixed the same issue for me, thank you very much. However, I'm worried it will happen again when training larger models. This link states some limitations Cuda has with WSL when training models:
https://docs.nvidia.com/cuda/wsl-user-guide/index.html
i'm experiencing the hanging issue described above as well. a year later. CUDA 11.7, WSL2, ubuntu 20.0.4. tried
wsl --update
. for small models it's fine. but for larger ones i'm guessing it runs out of memory in an ungraceful way. this same larger model though works fine on a linux box with the same nvidia card.
Same here, Win11 + WSL2 Ubuntu, train the resnet18 on ImageNet, and it randomly freeze and cannot accept new inputs.
Same issue here.
WSL2 stops responding after 40 min of training. Windows never went to sleep.
Not even sure how to debug it.