HigherOrderCO/Bend

Execution is stuck until termination when running on CUDA in WSL2

Opened this issue · 18 comments

wtfil commented

Reproducing the behavior

The issue

Hi,
I am having issues running code with bend run-cu on CUDA inside the WSL2. There are not errors, but code is not executing either. Execution is frozen, similary to while (true) {}.
Code executes without any issues when using bend run or bend run-c
Compiling with bend gen-cu and nvcc has the same result.
I've tried both 12.4 and 12.5 and result is the same.

What I attempted

I tried different code examples from the repo, but result is always the same. Since bend allowed to compile code to cuda with gen-cu, I tried to find what is broken inside generated file (assuming bend run-cu will use the same code). This issue happened inside gnet_normalize function, where code could never exit the for loop. This break is never callen (rlen always has the same value)

CUDA verification

Just to rule CUDA out, I have successfully installed CUDA and can confirm it is recognised by compiling and running this code.

cuda-test.cu:

#include <cuda.h>
#include <stdio.h>

int main(int argc, char** argv) {
  int driver_version = 0, runtime_version = 0;

  cudaDriverGetVersion(&driver_version);
  cudaRuntimeGetVersion(&runtime_version);

  printf(
    "Driver Version: %d\nRuntime Version: %d\n",
    driver_version,
    runtime_version
  );

  return 0;
}

output

~> nvcc cuda-test.cu -o cuda-test && ./cuda-test
Driver Version: 10010
Runtime Version: 12040

nvidia-smi

Calling from wsl

>  nvidia-smi.exe
Sun Jun  2 16:32:08 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 552.44                 Driver Version: 552.44         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070      WDDM  |   00000000:08:00.0  On |                  N/A |
| 49%   56C    P3             41W /  220W |    6496MiB /   8192MiB |     36%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3832    C+G   ...oogle\Chrome\Application\chrome.exe      N/A      |
|    0   N/A  N/A      8292    C+G   C:\Windows\explorer.exe                     N/A      |
|    0   N/A  N/A     21004    C+G   ...\cef\cef.win7x64\steamwebhelper.exe      N/A      |
|    0   N/A  N/A     21416    C+G   ...wekyb3d8bbwe\XboxGameBarWidgets.exe      N/A      |
|    0   N/A  N/A     27344    C+G   ...ience\NVIDIA GeForce Experience.exe      N/A      |
....
+-----------------------------------------------------------------------------------------+

System Settings

Example:

  • HVM: 2.0.18
  • Bend: 0.2.27
  • OS: Ubuntu 20.04.6 LTS
  • WSL: 2.1.5.0
  • CPU: AMD Ryzen 9 5900X
  • GPU: RTX 3070
  • Cuda Version: 12.5, V12.5.40

Additional context

No response

Is this for any program you try to run?
Also, what compiler version is nvcc using? By default it should be g++ and you can check its version with g++ --version

wtfil commented

Yes, all exmples from examples folder have the same beheviour.
For example compiled fib and it fail to break the same loop, because rlen is aways has the same value (which is different from run to run, but never zero)

nvcc and g++

~ > nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
~ > g++ --version
g++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I would like to also add that I have this same issue on my laptop running ubuntu linux. I have tried the sorter.bend script from the README on three machines now and I have gotten it to run on my other two (one has a laptop 3080 gpu and the other running dual RTX A4000s) however it won't run on my laptop with a Quadro P600 (cuda version 12.2). It does the same thing @wtfil is describing where it simply hangs.

Looking at the processor usage, it seems as though it might be having an issue allocating GPU memory (?). I can see it running 100% on a single cpu core but never load either the RAM or VRAM and there is no processing being run on the GPU either. Here are some specs from the machine in question if this might be a bug to be fixed in the future. (Already love this programming language BTW, really hoping I can start switching over to it at work when more support is added in).

Info:
HVM: 2.0.19
Bend: 0.2.27
OS: Ubuntu 22.04.4 LTS
Kernel: 6.5.0-35
CPU: Intel i5-9300H
GPU: Nvidia Quadro P600 Mobile
g++: 11.4.0
nvcc: 12.2
nvidia driver: 535.171.04

I'm getting the same issue with:
HVM: 2.0.19
Bend: 0.2.33
Ubuntu 22.04.4 LTS x86_64
Kernel: 6.5.0-35
CPU: AMD Ryzen 7 5800H
GPU: NVIDIA GeForce RTX 3070 Mobile
g++: 11.4.0
nvcc: 12.5
nvidia driver: 550.67

Same issue here on Ubuntu (not WSL). I tried debugging this issue in the official discord server here.

As @keaneflynn observed, it doesn't allocate vram correctly, and hangs for a long time until it crashes with the following message:

Failed to launch kernels. Error code: an illegal memory access was encountered.
Errors:
Failed to parse result from HVM.

I waited 45 minutes while another discord server member only waited 30 minutes with his example in a virtual machine. I don't think the time is as important since the program crashed shortly after I launched another application (steam in my case).

A quick summary of the debugging we did in discord: Downgrading from cuda 12.5 to 12.4 doesn't help. Examples unrelated to bend compiled by nvcc work just fine, so it's not a (simple) issue with cuda. Using run and run-c work for the examples provided by bend while using run-cu directly or gen-cu then compiling with nvcc hangs. A different member of the server mentioned it could be an issue with the smaller L1 cache size of my GXT 1660, but that doesn't seem to be the case due to multiple 3070's listed here also not working.

My Specs

HVM: 2.0.18
Bend: 0.2.27
Ubuntu 22.04.4 LTS x86_64
Kernel: 6.5.0-35
CPU: AMD Ryzen 7 5800X 8-Core Processor
GPU: NVIDIA GeForce GTX 1660 (ti I think?)
g++: 11.4.0
nvcc: V12.4.131
nvidia driver: 555.42.02

I'm getting the same issue with: HVM: 2.0.19 Bend: 0.2.33 Ubuntu 22.04.4 LTS x86_64 Kernel: 6.5.0-35 CPU: AMD Ryzen 7 5800H GPU: NVIDIA GeForce RTX 3070 Mobile g++: 11.4.0 nvcc: 12.5 nvidia driver: 550.67

This is fascinating as I have a laptop with nearly identical specs that does manage to use the run-cu properly. I have the 5800H on ubuntu 22.04 except it has a 3080 mobile. I am pretty sure the cache on these two chips are identical per @nmay231 inquiry. Hoping to see some bug fixes here soon!

Same issue here

GTX 1050 Ti (Nvidia Studio driver 551.23)
Intel i5-6400
Windows 10 22H2 (OS Build 19045.4474) --> WSL 2.0.14.0 (Kernel version 5.15.133.1-1) --> Ubuntu 24.04 LTS
g++ 13.2.0
nvcc cuda_12.5.r12.5/compiler.34177558_0
hvm 2.0.18
bend 0.2.27

Hey, I am also facing the same issue. All the dedicated GPU memory gets full. And the process is stuck.

bend 0.2.33
hvm 2.0.19
nvcc V12.5.40
wsl-ubuntu 22.04.3 LTS
gpu NVIDIA GeForce RTX 2060

@Imran-S-heikh I'm not certain if we have the same issue. My Video RAM for the bend/hvm process never went above 100 MiB.

Perhaps we should all make sure we are experiencing the same thing. Here's a very simple program, that shouldn't need much memory. It hangs with bend run-cu. I checked my GPU memory usage with nvtop (98 MiB VRAM, 0% GPU, 102 MiB RAM, 100% on a CPU core).

def main:
  return (1 + 1)

Also, I forgot to mention I did try running run-cu --verbose and this is the output of the program above.

bend run-cu --verbose simple.bend

% bend run-cu -v simple.bend
(Map/empty) = Map/Leaf

(Map/get map key) = match map = map { Map/Leaf: (*, map); Map/Node: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: let (got, rest) = (Map/get map.left (/ key 2)); (got, (Map/Node map.value rest map.right)); _ _-1: let (got, rest) = (Map/get map.right (/ key 2)); (got, (Map/Node map.value map.left rest)); }; _ _-1: (map.value, map); }; }

(Map/set map key value) = match map = map { Map/Node: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: (Map/Node map.value (Map/set map.left (/ key 2) value) map.right); _ _-1: (Map/Node map.value map.left (Map/set map.right (/ key 2) value)); }; _ _-1: (Map/Node value map.left map.right); }; Map/Leaf: switch _ = (== 0 key) { 0: switch _ = (% key 2) { 0: (Map/Node * (Map/set Map/Leaf (/ key 2) value) Map/Leaf); _ _-1: (Map/Node * Map/Leaf (Map/set Map/Leaf (/ key 2) value)); }; _ _-1: (Map/Node value Map/Leaf Map/Leaf); }; }

(IO/MAGIC) = (13683217, 16719857)

(IO/wrap x) = (IO/Done IO/MAGIC x)

(IO/bind a b) = match a = a { IO/Done: (b a.expr); IO/Call: (IO/Call IO/MAGIC a.func a.argm λx (IO/bind (a.cont x) b)); }

(call func argm) = (IO/Call IO/MAGIC func argm λx (IO/Done IO/MAGIC x))

(print text) = (IO/Call IO/MAGIC "PUT_TEXT" text λx (IO/Done IO/MAGIC x))

(get_time) = (IO/Call IO/MAGIC "GET_TIME" * λx (IO/Done IO/MAGIC x))

(sleep hi_lo) = (IO/Call IO/MAGIC "PUT_TIME" hi_lo λx (IO/Done IO/MAGIC x))

(main) = (+ 1 1)

@nmay231 I get the exact same output. I also have the same results- it uses all my CPU but no GPU

Can someone with the issue try running cuda version 11.x? (sudo apt install nvidia-cuda-toolkit will get version 11)
With my 980Ti, on cuda 11, I would instantly get the Failed to launch kernels error, whereas on cuda 12.5, the program just hangs.
This issue is most likely same as #364, where GPU memory architecture is the cause because bend was only tested on an rtx 4090.

@TimotejFasiang For me, sudo apt install nvidia-cuda-toolkit installed version 12.0.140~12.0.1-4build4, and I couldn't find any full 11.x version numbers to specify (e.g: E: Unable to locate package nvidia-cuda-toolkit@11.8.0-1). I don't use Linux very often so I might be doing something silly, though

wtfil commented

I have a little update on the issue, hope this will help to understand it better.
Initially I faced this issue when used ubuntu@20.04 from WSL.
Today I tried my second image - ubuntu@22.04 on the same machine and it worked! This image was almost fresh so I run installations steps from the README and bend run-cu worked right away

Here are the version of relevant tools for both images

name ubuntu@20.04 ubuntu@22.04
os Ubuntu 20.04.6 LTS Ubuntu 22.04.3 LTS
gcc gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
nvcc release 12.5, V12.5.40 release 12.5, V12.5.40
hvm 2.0.19 2.0.19
bend 0.2.33 0.2.33
cargo 1.80.0-nightly 1.78.0

The only major different is gcc between two

I also noticed that nvidia-smi (not the nvidia-smi.exe) is failing on ubuntu@20.04 with following error:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

nvidia-smi on ubuntu@22.04 works fine

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85                 Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070      WDDM  |   00000000:08:00.0  On |                  N/A |
|  0%   46C    P8             22W /  220W |    2652MiB /   8192MiB |     36%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Ubuntu@22.04

~/www/bend-examples > time bend run bitonic_sort.bend
Result: 16646144

real    0m35.908s
user    0m33.658s
sys     0m2.250s
~/www/bend-examples > time bend run-c bitonic_sort.bend
Result: 16646144

real    0m10.611s
user    1m3.683s
sys     1m44.188s
~/www/bend-examples > time bend run-cu bitonic_sort.bend
Result: 16646144

real    0m2.410s
user    0m1.996s
sys     0m0.080s

I am noticing similar. When running on WSL2, I noticed that the default parallel_hello_world would not finish (before I got bored of waiting and figured something was wrong). I have a standard RTX 3070, btw. I rewrote things to play around, and found that it ran plenty fast when running gen(13), just not gen(16). My guess is some issue with using too much GPU memory and getting stuck, as some have mentioned here.

Moreover, when running gen(16), my GPU continued to be fully running after CTRL+C the command line and attempting to terminate the process. Is this related to having no IO?

Info:
HVM: 2.0.19
Bend: 0.2.27
OS: wsl-ubuntu 22.04.3 LTS
Kernel: 6.5.0-35
CPU: RYZEN 3600X
GPU: RTX 3070
g++: 11.4.0

@evbxll Sounds like a different issue. The issue I and others are describing happens even for a very simple program like the one below

def main:
  return (1 + 1)

Also, the issue we're describing results in all our CPU being used, but none of our GPU, and CTRL+C does stop it from using all the CPU for me

@evbxll Sounds like a different issue. The issue I and others are describing happens even for a very simple program like the one below

def main:
  return (1 + 1)

Also, the issue we're describing results in all our CPU being used, but none of our GPU, and CTRL+C does stop it from using all the CPU for me

Eh, I feel like the original issue these comments are under is similar to me. Execution stuck when running CUDA WSL2

Yeah, one also get this issue, so one followed a few steps:

  1. #364 solution (since one had an 1080 Ti) but modify on the current latest version hvm v2.0.19 (instead of the v2.0.13 @hubble14567 originally mentioned).
  2. Follow @wtfil suggestion to install ubuntu 22.04 (and 24.04 LTS just in case, but both are installed on separate disk because you can't install both on the same folder same disk, otherwise, it'll share a single virtual disk and that's a huge issue); in the end, both works.
  3. Update driver from 536.xx (forgot the exact version) to 555.99 (the current latest version). If one tries to run nvidia-smi now should get segmentation fault. Restart your computer, then the error should be gone. Now, run bend run-cu simple.bend -s and it'll have no problem.

The fix is quite stupid, because it seems to take so long to move data from cpu to gpu that running the simple.bend suggested above took 5-6 seconds.

wabinab@...: $ bend run-cu simple.bend -s
Result: 2
- ITRS: 2
- LEAK: 0
- TIME: 5.83s
- MIPS: 0.00

Edit: Anyway, one tries to run a second time and it seems to decrease in time, although the simple isn't worth it.

wabinab@...: $ bend run-cu simple.bend -s
Result: 2
- ITRS: 2
- LEAK: 0
- TIME: 0.29s
- MIPS: 0.00

Similarly, if we try run the parallel_sum.bend as a hello world, the results aren't enticing with 1080 Ti:

bend run-c parallel_sum.bend -s
Result: 5908768
- ITRS: 45999971
- TIME: 0.69s
- MIPS: 66.89

bend run-cu parallel_sum.bend -s
Result: 5908768
- ITRS: 45983587
- LEAK: 37606783
- TIME: 0.83s
- MIPS: 55.62

There's a lot of LEAK, and calculations are slower compared to 4-core CPU (i5-7400).

is there a fix for this?
running into same issue