OpenCL failed to compile/vectorvisor stall

Question

OpenCL failed to compile/vectorvisor stall

Closed this issue a year ago · 4 comments

Hello, I am looking into this project to learn what you have done in this work.
I tried to run scrypt benchmark in the repository, by running the following commands(see build.sh in the attachment).

cd benchmark/scrypt
cargo build --release
cargo run --release -- --ip=0.0.0.0 --heap=3145728 --stack=262144 --hcallsize=131072 --partition=false --serverless=true --vmcount=4096 --vmgroups=1 --interleave=1 --pinput=true --fastreply=true --lgroup=64 --nvidia=true --input=benchmarks/scrypt/target/wasm32-wasi/release/scrypt.wasm

The generated opencl code failed to compile, with some messasges like this(full log is in compilation_failed.txt):

<kernel>:157873:37: error: redefinition of '__func_func_getenv'
<kernel>:168648:37: error: redefinition of '__func_func__ZN3std5alloc8rust_oom17hb466c6b0b424784eE'
<kernel>:176935:37: error: redefinition of '__func_func__ZN3std3sys4wasi4once4Once4call17h43619e2953d53b25E'
<kernel>:179098:37: error: redefinition of '__func_func__ZN4core6option13expect_failed17h8f72e66e0b3163c7E'
<kernel>:181066:37: error: redefinition of '__func_func__ZN72_$LT$sha2sha256Sha256$u20$as$u20$digestfixedFixedOutputDirty$GT$19finalize_into_dirty17h563df1210a5950c5E'
<kernel>:181743:37: error: redefinition of '__func_func__ZN4sha26sha2569Engine2566update17hcb501717ee07d7caE'
<kernel>:234127:37: error: redefinition of '__func_func__ZN3std3sys4wasi4once4Once4call17hee18ac680eb799ccE'
<kernel>:238120:37: error: redefinition of '__func_func___main_void'

I think this is because some functions appear in more than one partitions, and the modification as follow(cfg_optimizer.rs in the attachment) seems to solve the problem.

diff --git a/src/opencl_writer/cfg_optimizer.rs b/src/opencl_writer/cfg_optimizer.rs
index 9a5e6a5..2bb119c 100644
--- a/src/opencl_writer/cfg_optimizer.rs
+++ b/src/opencl_writer/cfg_optimizer.rs
@@ -254,6 +254,9 @@ pub fn form_partitions(

         current_partition.insert(String::from(f_name.clone()));

+               let func_copies = include_limit.get(&f_name).cloned().unwrap_or(0);
+               include_limit.insert(f_name.clone(), func_copies + 1);
+
         let (loop_called_fns, called_fns) = get_called_funcs(
             writer_ctx,
             &indirect_call_mapping_formatted,

I'm not sure if this is what the function intend to perform(the function seems to support more than one function copies, how would the other components handle that?).

By the way, does the --partition parameter have something to do with this? I previously run similar commands without disabling partition seems to pass the compilation, but result in some memory access violation.

With the patch, the vm started. Then I execute

go run run_scrypt.go 127.0.0.1 8000 1 1 300 256

to start the test. But nothing happened. The log is in the attachment(run.txt).

The line Set entry point: ["func_strncmp"] seems strange, I think it would be something like __start or main.

With other configurations, I received illegal memory access or unsupported hypercall errors.

The attachment include the scrypt.wasm generated by cargo.

Do you have some advice? Thank you.
Attach_20230925.tar.gz

Answer 1 · 2023-09-25T08:09:23.000Z

When I run the following command to start vectorvisor, with same testing command.

cargo run --release -- --ip=0.0.0.0 --heap=3145728 --stack=262144 --hcallsize=131072 --partition=false --serverless=true --vmcount=1 --vmgroups=1 --interleave=1 --pinput=true --fastreply=true --lgroup=1 --nvidia=true --input=benchmarks/scrypt/target/wasm32-wasi/release/scrypt.wasm
go run run_scrypt.go 127.0.0.1 8000 1 1 300 256

Vectovisor would complain about CL_NV_INVALID_MEM_ACCESS. The full log is in the attachment.
invalid_memory.txt

Do you have some idea about why this happen, or some advice on debugging with the generated kernel?

Answer 2 · 2023-09-25T15:14:21.000Z

Hi,

Thanks for taking the time to check out VectorVisor! Those are all great questions.

First, I should probably explain the "partitions" concept as it wasn't included in the final paper (and it was also excluded from the evaluation). Early on (and without the partitioner in general) I ran into problems with register spilling and was trying to figure out ways to reduce the overhead from this for the comparatively large programs I was trying to run. The idea is that the overhead from calling into other kernels via the CPS transform was cheaper in some cases than the extra register spilling, and we did have positive results.

The solution I came up with was to partition the resulting openCL kernels by function size/register usage based on the control flow graph (which functions call which other functions + some other heuristics). The partitioner does actually work despite not being in the paper, but when the --partition=false flag is present, there also needs to be a --maxdup=0 flag paired with it to ensure that functions aren't duplicated (as they would be if it were enabled). Alternatively you can enable it and play around with the maxdup value. The default value for this is "1" (1 extra include per function), which is why this error is popping up.

I think you brought up a good point though, which is that this mismatch with false/maxdup could be checked in src/main and an error value returned then (I was the only user for quite some time so never ran into this haha).

For the invalid memory access I suspect it is a result of the lack of compiler optimizations applied to the WASM binary which causes extra memory usage in the final program. It's also possible that commenting those lines out caused weird behavior/bugs elsewhere. The set entry point debug line only prints the expected value when you are running 1 function per partition (an artifact from a debug configuration I used often). You can run with the debugcallprint flag enabled and that will correctly log the entry point in any configuration, although it will dump a volume of data to stdout (all function calls, etc...).

If you are testing locally, you can make use of the run_cached_bin.sh script (poorly named I admit) in the benchmarks dir, which will run all of the compiler optimizations we run in the final paper + invoke VectorVisor with the correct CLI arguments.

To run the scrypt benchmark locally you would add the line to the script:

# format is:
# command, benchmark, heap, stack, hypercall buffer, vmcount, ignore last arg
comp "scrypt" "3145728" "131072" "131072" "VMCOUNT" "5120"

and replace VMCOUNT with a vm count value that fits your local GPU (ignore the last argument, as the script was copy pasted from the script I used in the final evaluation).

After the benchmark is ran at least once you can replace the comp command with runbin and VV should load much faster.

I just reran the benchmark on an RTX 2080 Ti + 2048 VMs and it works for me. Let me know if it still doesn't work after.

Sam

Answer 3 · 2023-09-26T10:42:40.000Z

With the help of run_cached_bin.sh, I could run the scrypt benchmark and get some result.

#added to run_cached_bin.sh
comp "scrypt" "3145728" "131072" "131072" "64" "5120"

go run run_scrypt.go 127.0.0.1 8000 64 1 300

server is active... starting benchmark
Benchmark complete: 232249 requests completed
duration: 300.000000
Total RPS: 774.163333
On device execution time: 39445040.440148
Average request latency: 82649467.696640
queue submit time: 9007.574870
submit count: 1.000000
unique fns: 1.000000
Request Queue Time: 3900.823155
Device Time: 82170664.866686
overhead: 488422.273129
compile time: 326291776113.782898

And I realized that some paramters of the testing script are related to the VV setting.

Thanks for the explanation and suggestions.

Answer 4 · 2023-09-26T16:14:02.000Z

No problem. I'll close this issue for now as it seems to be resolved.