Execution is stuck until termination when running on CUDA in WSL2
Opened this issue · 18 comments
Reproducing the behavior
The issue
Hi,
I am having issues running code with bend run-cu
on CUDA inside the WSL2. There are not errors, but code is not executing either. Execution is frozen, similary to while (true) {}
.
Code executes without any issues when using bend run
or bend run-c
Compiling with bend gen-cu
and nvcc
has the same result.
I've tried both 12.4 and 12.5 and result is the same.
What I attempted
I tried different code examples from the repo, but result is always the same. Since bend allowed to compile code to cuda with gen-cu
, I tried to find what is broken inside generated file (assuming bend run-cu
will use the same code). This issue happened inside gnet_normalize
function, where code could never exit the for
loop. This break
is never callen (rlen
always has the same value)
CUDA verification
Just to rule CUDA out, I have successfully installed CUDA and can confirm it is recognised by compiling and running this code.
cuda-test.cu
:
#include <cuda.h>
#include <stdio.h>
int main(int argc, char** argv) {
int driver_version = 0, runtime_version = 0;
cudaDriverGetVersion(&driver_version);
cudaRuntimeGetVersion(&runtime_version);
printf(
"Driver Version: %d\nRuntime Version: %d\n",
driver_version,
runtime_version
);
return 0;
}
output
~> nvcc cuda-test.cu -o cuda-test && ./cuda-test
Driver Version: 10010
Runtime Version: 12040
nvidia-smi
Calling from wsl
> nvidia-smi.exe
Sun Jun 2 16:32:08 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.44 Driver Version: 552.44 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 WDDM | 00000000:08:00.0 On | N/A |
| 49% 56C P3 41W / 220W | 6496MiB / 8192MiB | 36% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3832 C+G ...oogle\Chrome\Application\chrome.exe N/A |
| 0 N/A N/A 8292 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 21004 C+G ...\cef\cef.win7x64\steamwebhelper.exe N/A |
| 0 N/A N/A 21416 C+G ...wekyb3d8bbwe\XboxGameBarWidgets.exe N/A |
| 0 N/A N/A 27344 C+G ...ience\NVIDIA GeForce Experience.exe N/A |
....
+-----------------------------------------------------------------------------------------+
System Settings
Example:
- HVM: 2.0.18
- Bend: 0.2.27
- OS: Ubuntu 20.04.6 LTS
- WSL: 2.1.5.0
- CPU: AMD Ryzen 9 5900X
- GPU: RTX 3070
- Cuda Version: 12.5, V12.5.40
Additional context
No response
Is this for any program you try to run?
Also, what compiler version is nvcc using? By default it should be g++
and you can check its version with g++ --version
Yes, all exmples from examples
folder have the same beheviour.
For example compiled fib
and it fail to break the same loop, because rlen
is aways has the same value (which is different from run to run, but never zero)
nvcc
and g++
~ > nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
~ > g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I would like to also add that I have this same issue on my laptop running ubuntu linux. I have tried the sorter.bend script from the README on three machines now and I have gotten it to run on my other two (one has a laptop 3080 gpu and the other running dual RTX A4000s) however it won't run on my laptop with a Quadro P600 (cuda version 12.2). It does the same thing @wtfil is describing where it simply hangs.
Looking at the processor usage, it seems as though it might be having an issue allocating GPU memory (?). I can see it running 100% on a single cpu core but never load either the RAM or VRAM and there is no processing being run on the GPU either. Here are some specs from the machine in question if this might be a bug to be fixed in the future. (Already love this programming language BTW, really hoping I can start switching over to it at work when more support is added in).
Info:
HVM: 2.0.19
Bend: 0.2.27
OS: Ubuntu 22.04.4 LTS
Kernel: 6.5.0-35
CPU: Intel i5-9300H
GPU: Nvidia Quadro P600 Mobile
g++: 11.4.0
nvcc: 12.2
nvidia driver: 535.171.04
I'm getting the same issue with:
HVM: 2.0.19
Bend: 0.2.33
Ubuntu 22.04.4 LTS x86_64
Kernel: 6.5.0-35
CPU: AMD Ryzen 7 5800H
GPU: NVIDIA GeForce RTX 3070 Mobile
g++: 11.4.0
nvcc: 12.5
nvidia driver: 550.67
Same issue here on Ubuntu (not WSL). I tried debugging this issue in the official discord server here.
As @keaneflynn observed, it doesn't allocate vram correctly, and hangs for a long time until it crashes with the following message:
Failed to launch kernels. Error code: an illegal memory access was encountered.
Errors:
Failed to parse result from HVM.
I waited 45 minutes while another discord server member only waited 30 minutes with his example in a virtual machine. I don't think the time is as important since the program crashed shortly after I launched another application (steam in my case).
A quick summary of the debugging we did in discord: Downgrading from cuda 12.5 to 12.4 doesn't help. Examples unrelated to bend compiled by nvcc
work just fine, so it's not a (simple) issue with cuda. Using run
and run-c
work for the examples provided by bend while using run-cu
directly or gen-cu
then compiling with nvcc
hangs. A different member of the server mentioned it could be an issue with the smaller L1 cache size of my GXT 1660, but that doesn't seem to be the case due to multiple 3070's listed here also not working.
My Specs
HVM: 2.0.18
Bend: 0.2.27
Ubuntu 22.04.4 LTS x86_64
Kernel: 6.5.0-35
CPU: AMD Ryzen 7 5800X 8-Core Processor
GPU: NVIDIA GeForce GTX 1660 (ti I think?)
g++: 11.4.0
nvcc: V12.4.131
nvidia driver: 555.42.02
I'm getting the same issue with: HVM: 2.0.19 Bend: 0.2.33 Ubuntu 22.04.4 LTS x86_64 Kernel: 6.5.0-35 CPU: AMD Ryzen 7 5800H GPU: NVIDIA GeForce RTX 3070 Mobile g++: 11.4.0 nvcc: 12.5 nvidia driver: 550.67
This is fascinating as I have a laptop with nearly identical specs that does manage to use the run-cu properly. I have the 5800H on ubuntu 22.04 except it has a 3080 mobile. I am pretty sure the cache on these two chips are identical per @nmay231 inquiry. Hoping to see some bug fixes here soon!
Same issue here
GTX 1050 Ti (Nvidia Studio driver 551.23)
Intel i5-6400
Windows 10 22H2 (OS Build 19045.4474) --> WSL 2.0.14.0 (Kernel version 5.15.133.1-1) --> Ubuntu 24.04 LTS
g++ 13.2.0
nvcc cuda_12.5.r12.5/compiler.34177558_0
hvm 2.0.18
bend 0.2.27
Hey, I am also facing the same issue. All the dedicated GPU memory gets full. And the process is stuck.
bend 0.2.33
hvm 2.0.19
nvcc V12.5.40
wsl-ubuntu 22.04.3 LTS
gpu NVIDIA GeForce RTX 2060
@Imran-S-heikh I'm not certain if we have the same issue. My Video RAM for the bend/hvm process never went above 100 MiB.
Perhaps we should all make sure we are experiencing the same thing. Here's a very simple program, that shouldn't need much memory. It hangs with bend run-cu
. I checked my GPU memory usage with nvtop
(98 MiB VRAM, 0% GPU, 102 MiB RAM, 100% on a CPU core).
def main:
return (1 + 1)
Also, I forgot to mention I did try running run-cu --verbose
and this is the output of the program above.
bend run-cu --verbose simple.bend
bend run-cu --verbose simple.bend
% bend run-cu -v simple.bend
(Map/empty) = Map/Leaf
(Map/get map key) = match map = map { Map/Leaf: (*, map); Map/Node: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: let (got, rest) = (Map/get map.left (/ key 2)); (got, (Map/Node map.value rest map.right)); _ _-1: let (got, rest) = (Map/get map.right (/ key 2)); (got, (Map/Node map.value map.left rest)); }; _ _-1: (map.value, map); }; }
(Map/set map key value) = match map = map { Map/Node: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: (Map/Node map.value (Map/set map.left (/ key 2) value) map.right); _ _-1: (Map/Node map.value map.left (Map/set map.right (/ key 2) value)); }; _ _-1: (Map/Node value map.left map.right); }; Map/Leaf: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: (Map/Node * (Map/set Map/Leaf (/ key 2) value) Map/Leaf); _ _-1: (Map/Node * Map/Leaf (Map/set Map/Leaf (/ key 2) value)); }; _ _-1: (Map/Node value Map/Leaf Map/Leaf); }; }
(IO/MAGIC) = (13683217, 16719857)
(IO/wrap x) = (IO/Done IO/MAGIC x)
(IO/bind a b) = match a = a { IO/Done: (b a.expr); IO/Call: (IO/Call IO/MAGIC a.func a.argm λx (IO/bind (a.cont x) b)); }
(call func argm) = (IO/Call IO/MAGIC func argm λx (IO/Done IO/MAGIC x))
(print text) = (IO/Call IO/MAGIC "PUT_TEXT" text λx (IO/Done IO/MAGIC x))
(get_time) = (IO/Call IO/MAGIC "GET_TIME" * λx (IO/Done IO/MAGIC x))
(sleep hi_lo) = (IO/Call IO/MAGIC "PUT_TIME" hi_lo λx (IO/Done IO/MAGIC x))
(main) = (+ 1 1)
@nmay231 I get the exact same output. I also have the same results- it uses all my CPU but no GPU
Can someone with the issue try running cuda
version 11.x? (sudo apt install nvidia-cuda-toolkit
will get version 11)
With my 980Ti, on cuda 11, I would instantly get the Failed to launch kernels
error, whereas on cuda 12.5, the program just hangs.
This issue is most likely same as #364, where GPU memory architecture is the cause because bend
was only tested on an rtx 4090.
@TimotejFasiang For me, sudo apt install nvidia-cuda-toolkit
installed version 12.0.140~12.0.1-4build4
, and I couldn't find any full 11.x version numbers to specify (e.g: E: Unable to locate package nvidia-cuda-toolkit@11.8.0-1
). I don't use Linux very often so I might be doing something silly, though
I have a little update on the issue, hope this will help to understand it better.
Initially I faced this issue when used ubuntu@20.04
from WSL.
Today I tried my second image - ubuntu@22.04
on the same machine and it worked! This image was almost fresh so I run installations steps from the README
and bend run-cu
worked right away
Here are the version of relevant tools for both images
name | ubuntu@20.04 | ubuntu@22.04 |
---|---|---|
os | Ubuntu 20.04.6 LTS | Ubuntu 22.04.3 LTS |
gcc | gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 | gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 |
nvcc | release 12.5, V12.5.40 | release 12.5, V12.5.40 |
hvm | 2.0.19 | 2.0.19 |
bend | 0.2.33 | 0.2.33 |
cargo | 1.80.0-nightly | 1.78.0 |
The only major different is gcc
between two
I also noticed that nvidia-smi
(not the nvidia-smi.exe
) is failing on ubuntu@20.04
with following error:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
nvidia-smi
on ubuntu@22.04
works fine
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 WDDM | 00000000:08:00.0 On | N/A |
| 0% 46C P8 22W / 220W | 2652MiB / 8192MiB | 36% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
Ubuntu@22.04
~/www/bend-examples > time bend run bitonic_sort.bend
Result: 16646144
real 0m35.908s
user 0m33.658s
sys 0m2.250s
~/www/bend-examples > time bend run-c bitonic_sort.bend
Result: 16646144
real 0m10.611s
user 1m3.683s
sys 1m44.188s
~/www/bend-examples > time bend run-cu bitonic_sort.bend
Result: 16646144
real 0m2.410s
user 0m1.996s
sys 0m0.080s
I am noticing similar. When running on WSL2, I noticed that the default parallel_hello_world would not finish (before I got bored of waiting and figured something was wrong). I have a standard RTX 3070, btw. I rewrote things to play around, and found that it ran plenty fast when running gen(13), just not gen(16). My guess is some issue with using too much GPU memory and getting stuck, as some have mentioned here.
Moreover, when running gen(16), my GPU continued to be fully running after CTRL+C the command line and attempting to terminate the process. Is this related to having no IO?
Info:
HVM: 2.0.19
Bend: 0.2.27
OS: wsl-ubuntu 22.04.3 LTS
Kernel: 6.5.0-35
CPU: RYZEN 3600X
GPU: RTX 3070
g++: 11.4.0
@evbxll Sounds like a different issue. The issue I and others are describing happens even for a very simple program like the one below
def main:
return (1 + 1)
Also, the issue we're describing results in all our CPU being used, but none of our GPU, and CTRL+C does stop it from using all the CPU for me
@evbxll Sounds like a different issue. The issue I and others are describing happens even for a very simple program like the one below
def main: return (1 + 1)Also, the issue we're describing results in all our CPU being used, but none of our GPU, and CTRL+C does stop it from using all the CPU for me
Eh, I feel like the original issue these comments are under is similar to me. Execution stuck when running CUDA WSL2
Yeah, one also get this issue, so one followed a few steps:
- #364 solution (since one had an 1080 Ti) but modify on the current latest version hvm v2.0.19 (instead of the v2.0.13 @hubble14567 originally mentioned).
- Follow @wtfil suggestion to install ubuntu 22.04 (and 24.04 LTS just in case, but both are installed on separate disk because you can't install both on the same folder same disk, otherwise, it'll share a single virtual disk and that's a huge issue); in the end, both works.
- Update driver from 536.xx (forgot the exact version) to 555.99 (the current latest version). If one tries to run
nvidia-smi
now should getsegmentation fault
. Restart your computer, then the error should be gone. Now, runbend run-cu simple.bend -s
and it'll have no problem.
The fix is quite stupid, because it seems to take so long to move data from cpu to gpu that running the simple.bend
suggested above took 5-6 seconds.
wabinab@...: $ bend run-cu simple.bend -s
Result: 2
- ITRS: 2
- LEAK: 0
- TIME: 5.83s
- MIPS: 0.00
Edit: Anyway, one tries to run a second time and it seems to decrease in time, although the simple isn't worth it.
wabinab@...: $ bend run-cu simple.bend -s
Result: 2
- ITRS: 2
- LEAK: 0
- TIME: 0.29s
- MIPS: 0.00
Similarly, if we try run the parallel_sum.bend
as a hello world, the results aren't enticing with 1080 Ti:
bend run-c parallel_sum.bend -s
Result: 5908768
- ITRS: 45999971
- TIME: 0.69s
- MIPS: 66.89
bend run-cu parallel_sum.bend -s
Result: 5908768
- ITRS: 45983587
- LEAK: 37606783
- TIME: 0.83s
- MIPS: 55.62
There's a lot of LEAK, and calculations are slower compared to 4-core CPU (i5-7400).
is there a fix for this?
running into same issue