MPS has an open bug that we reported in NVIDIA Developer forum with id 3559606. The issue is that when running mulitple clients with MPS is slower than Native CUDA that used time-sharing. Dir kernel_without_inout contains the microbenchmark that will run mulitple times to evaluate the performance improvement provided by MPS compared to Native.
Use the Makefile in kernel_without_inout.
In the kernel_without_inout directory adjust the jenna_conf/execute.sh. To find the optimal cores use nvidia-smi topo --matrix.
It takes as parameters MPS or No MPS and the concurrency.
In the kernel_without_inout there are scripts for starting and stopping MPS. Then run multiple times the jenna_conf/execute.sh.
Concurrent instances | MPS | NATIVE |
---|---|---|
1 | 1786.074 | 1784.629 |
2 | 1958.4375 | 2218.039 |
4 | 10991.53675 | 4412.19625 |