enfiskutensykkel/ssd-gpu-dma

Issue when using the cuda example/benchmark

ZaidQureshi opened this issue · 12 comments

I have been successful in run the nvm-latency-bench without GPU. The output of that is as follows:

./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1000  --queue="no=128,location=local" --bw

Resetting controller... DONE
Preparing queues... DONE
Preparing buffers and transfer lists... DONE
Running bandwidth benchmark (reading, sequential, 1000 iterations)... DONE
Calculating percentiles...
Queue #128 read percentiles (1000 samples)
            bandwidth,       adj iops,    cmd latency,    prp latency
  max:       2118.074,     517108.001,         67.191,          2.150
 0.99:       2107.156,     514442.488,         65.464,          2.095
 0.97:       2102.182,     513227.943,         64.984,          2.079
 0.95:       2097.901,     512182.780,         64.776,          2.073
 0.90:       2093.795,     511180.541,         64.481,          2.063
 0.75:       2084.105,     508814.706,         63.536,          2.033
 0.50:       2070.331,     505451.803,         61.828,          1.978
 0.25:       2014.709,     491872.302,         61.419,          1.965
 0.10:       1985.443,     484727.223,         61.136,          1.956
 0.05:       1976.456,     482533.263,         61.015,          1.952
 0.01:       1957.190,     477829.660,         60.771,          1.945
  min:       1905.024,     465093.782,         60.432,          1.934
End percentiles
OK!

But when I try to run it with a GPU or run the nvm-cuda-bench binary, I get an error saying the following: "Unexpected error: Failed to map device memory: Invalid argument"

./bin/nvm-cuda-bench --ctrl=/dev/libnvm0

CUDA device           : 0 Tesla V100-PCIE-16GB (0000:07:00.0)
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : no
Unexpected error: Failed to map device memory: Invalid argument

Hi,

Could you try passing --gpu 0 as argument to nvm-latency-bench and post the results?

Also, please run dmesg and post any output from the libnvm helper kernel module (if there is any).

It could also be worth trying to build the project in debug mode (using -DCMAKE_BUILD_TYPE=Debug as argument to CMake. Have you verified that the iommu is disabled?

Regards,
Jonas

Could you also verify that the IOMMU is disabled? (For example show the output of cat /proc/cmdline | grep iommu). I plan on implementing support for it in newer kernels, but haven't gotten so far yet.

Regards,
Jonas

When I run nvm-latency-bench I get the following output

./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1000  --queue="no=128,location=local" --gpu 0

Resetting controller... DONE
Preparing queues... DONE
Preparing buffers and transfer lists... FAIL
Unexpected runtime error: Failed to map device memory for controller: Invalid argument

This is the output after I recompiled with debug mode

./bin/nvm-latency-bench --ctrl=/dev/libnvm0 --blocks=1000  --queue="no=128,location=local" --gpu 0

Resetting controller... DONE
Preparing queues... DONE
Preparing buffers and transfer lists... [map_memory] Page mapping kernel request failed: Invalid argument
FAIL
Unexpected runtime error: Failed to map device memory for controller: Invalid argument

The output in the kernel log after running the above program is

[Aug 3 09:50] Unknown ioctl command from process 28198: 1075347458

IOMMU should be disabled as there is no output printed for cat /proc/cmdline | grep iommu

Hi,

It appears that the kernel module has not been compiled with CUDA support. When you run CMake on a clean catalogue, the status output should say Using NVIDIA driver found in ${driver_dir} and Configuring kernel module with CUDA.

The build script tries to locate the driver automatically, but might fail looking up (for example if the Module.symvers file isn't generated), in which case you probably need to point it to the driver path manually. For example:
cmake .. -DNVIDIA=/usr/src/nvidia-384-384.111

The driver folder also need to contain a file called Module.symvers, if it doesn't you probably need to run make in that directory so that it is generated.

Let me know if this solves your problem or not. :)

Regards,
Jonas

Thank you for your responses. Your suggestions worked, I believe it was the issue of the missing Module.symvers file.

By the way, I have tested your code on kernel 4.15 and it seems to be working fine.

This is not really an issue, but I have a question: with the cuda-latency test or the nvm-latency with --gpu flag, is CPU compeltely out of the data and control path as you are using both GPUDirect RDMA and ASYNC?

Thanks.

Great to hear :)

When you use the --gpu flag on nvm-latency-bench the CPU is still responsible for submitting commands and processing completions, but the disk is writing data directly into (or reading directly from) GPU memory. So this example only uses the GPUDirect RDMA feature.

It's only for the nvm-cuda-bench that the CPU is entirely out of the control path and everything is controlled by the GPU. The relevant code for this is in benchmarks/cuda/main.cu#, namely the readSingleBuffered and readDoubleBuffered CUDA kernels. In other words, this example uses both Async and RDMA.

Also note that this benchmark has lower bandwidth because it is moving memory from an input buffer into an output buffer (with an offset) in order to emulate a GPU workload.

By nvidia-cuda-bench did you mean nvm-cuda-bench?
Thank you, actually I am interested in the nvm-cuda-bench example as it completely removes CPU from the control path. I will look at the code that you pointed at. Thank you so much for the hints.

Actually, I just decided to test nvm-cuda-bench, and either their is an issue in some calculations in the code or maybe something is broken but when I run it I get the following output:

./bin/nvm-cuda-bench --ctrl=/dev/libnvm0 

CUDA device           : 0 Tesla V100-PCIE-16GB (0000:09:00.0)
Controller page size  : 4096 B
Namespace block size  : 512 B
Number of threads     : 32
Chunks per thread     : 32
Pages per chunk       : 1
Total number of pages : 1024
Total number of blocks: 8192
Double buffering      : no
Event time elapsed    : 8.192 µs
Estimated bandwidth   : 512000.001 MiB/s

That is an insanely high bandwidth. My SSD is supposed to have a bandwidth of about 6GB/s.

When I run it with stats set to true all the columns have the value of either 0 or -nan.

I suspect this may be caused by me not having tested for more recent GPUs than Pascal and not compiling SM code for newer architectures.

Please try adding specifying -Dnvidia_archs=70 to CMake and rebuild (you might need to do a make clean first).

Let me know if that works :)

Regards,
Jonas

Hi @ZaidQureshi,

Just a follow up, did setting the nvidia_archs flag work for you?

Regards,
Jonas

I hope that the issue was resolved. Please reopen it or make any additional comments if there is anything else.

I have met the same problem and setting nvidia_archs just worked for me! Thank you for your help :)