Writing the coordination with openACC
Opened this issue · 3 comments
As I did with CUDA(#1028) and I tried to do with Arrayfire(#1049) and pytorch, I tried to rewrite the COORDINATION cv with openACC as accelerator.
Here's the result, using the new benchmark tool
Is slower than CUDA, but writing in openACC may be more familiar, because it looks like openMP and also because you can leave the compiler to guess how to implement the parallelization of the loops and you do not have to use the <<<>>>
to launch kernels like in CUDA. And is way more flexible than the tensor libraries.
On the compilation I have some mixed feelings, as you can read in the spoiler below.
Details about compilation and script used
I run everything on my workstation (NVIDIA T1000 8GB + AMD Ryzen 5 PRO 5650G)
I used nvhpc24.3, downloaded already compiled from the Nvidia site.
The environment used is actually slightly complex:
I compiled plumed from master with plain gcc+mpi
Then I compiled the plugin with my wild Makefile that uses nvc++ for the accelerated part and g++ for the main body of the CV.
Then I ran the benchmark without nvhpc in the environment, because it conflicts with the mpi that I used with plumed:
nsteps=100
list_of_natoms="500 2000 4000 6000 8000 10000 12000 14000 16000"
export PLUMED_NUM_THREADS=8
useDistr="line sc"
useDistr="sc"
for distr in $useDistr; do
for natoms in $list_of_natoms; do
fname="${distr}_wACC_${PLUMED_NUM_THREADS}threads_${natoms}_Steps${nsteps}"
plumed benchmark --plumed="plumed.dat:cudasingleplumed.dat:accplumed.dat" \
--natoms=${natoms} --nsteps=${nsteps} --atom-distribution=${distr} >"${fname}.out"
grep -B1 Comparative "${fname}.out"
done
done
rm -f bck.*
(I have to try to make everything run compiled with plain nvhpc
But since nvhpc does not like the kw auto for deducing return types (as used in tools/MergeVectorTools.h:54), it needs some massages to the plumed source and I did not want to touch src for this project)
If you look at the code I also added a few extra headers:
LoopUnroller.h
Tensor.h
Vector.h
that are a variant to the originals header with the possibility of declaring Tensors and Vector of any type.
and some splashes of refactor to c++17 where I did not managed to convince nvc++ to deduce the template arguments as I wantedTools_pow.h
that templatizes the type int the runtime version offastpow
Since these modifications are a prerequisite to the use of openACC but are completely independent from it. If you are ok with this, I would like to open a PR with a patch to the original .h
files
Regarding the vector and tensor with generic type, I tried to do the same a few years ago and I remember that with intel compiler the performances were measurably affected (to my surprise). Maybe you can double check this. In case it's true, maybe we can duplicate the code. Otherwise I am also happy with a more general version, it would be useful in other parts of the code as well
(I have to try to make everything run compiled with plain nvhpc
But since nvhpc does not like the kw auto for deducing return types (as used in tools/MergeVectorTools.h:54), it needs some massages to the plumed source and I did not want to touch src for this project)
If it's limited to this maybe we can adjust the code. It would be ideal if we could also install nvc++ on one job in GitHub actions to test for this
Regarding the vector and tensor with generic type, I tried to do the same a few years ago and I remember that with intel compiler the performances were measurably affected (to my surprise). Maybe you can double check this. In case it's true, maybe we can duplicate the code. Otherwise I am also happy with a more general version, it would be useful in other parts of the code as well
Ok, so I set up the PR as a wip, then I will produce some benchmarks
If it's limited to this maybe we can adjust the code. It would be ideal if we could also install nvc++ on one job in GitHub actions to test for this
I'm trying to do it in #1076