MPI - Error with VexCL backend but not builtin or CUDA backends
Closed this issue · 4 comments
I'm solving a system with 4 MPI ranks. The system solves successfully using the builtin or CUDA backends of AMGCL, with MPI. However, when I use the VexCL backend, in debug mode with gdb I get
/home/ubuntu/external_libraries/vexcl/vexcl/gather.hpp:56: vex::detail::index_partition<T>::index_partition(const std::vector<vex::backend::cuda::command_queue>&, size_t, std::vector<long unsigned int>) [with T = double; size_t = long unsigned int]: Assertion `std::is_sorted(indices.begin(), indices.end())' failed.
(error occurs on every rank).
I looked at indices
and it's definitely not sorted, but indices
has one "chunk" per neighbor, and each chunk is sorted. I'm not sure why VexCL wants indices
to be sorted overall (I drew out an example and it seems like the chunks should be sorted, but the overall list should not be). This error does not occur with 1 rank (obviously) or 2 ranks, but consistently appears with 4 MPI ranks.
I saved the local (reordered) system and right hand side from each of the 4 ranks in matrix market format from Eigen, they are attached here. I think the failure should be reproducible, I've reproduced it on multiple machines. And I suspect this may be a legitimate bug since the system appears to solve fine with the Builtin or CUDA backends. But let me know what you think! Thanks!
mat_0.txt
mat_1.txt
mat_2.txt
mat_3.txt
rhs_0.txt
rhs_1.txt
rhs_2.txt
rhs_3.txt
Thank you for reporting,
I can reproduce the issue, will look into it.
So ddemidov/vexcl@034abe9 should deal with the most common case of a single GPU per MPI process.
Things are more complicated when there is more that one compute device per MPI process. VexCL wants the indices to be sorted so that each device can gather a continuous set of indices from its own memory. Doing this correctly requires sorting the indices before creating vex::gather
object, and then permuting the gathered results back into originally requested positions (so that MPI sends are now continuous).
I need to think about how to do this so that there is minimal additional work for the most common case of a single GPU per MPI process.
ddemidov/vexcl@5ec2f10 should work in all cases.
Thank you so much, that commit seems to work!! Merry Christmas!