Crash when using large 3D NArrays
Mietzsch opened this issue · 9 comments
When run on a VM with between 3-6GB of RAM and 4 units, this gives a bus error and crashes inside the main for-loop. Depending on the exact hardware, sometimes disabeling line 20 solves this, sometimes it even crashes when the v array is not even declared (line 17). Note that this array is not even needed in the calculation. When run with only 2 units, the program gets about double as far, compared to 4 units, and serial execution works completely fine. I am using dash-0.4.0.
#include <libdash.h>
#include <vector>
using namespace std;
int main(int argc, char* argv[])
{
dash::init(&argc, &argv);
int n1 = 514;
int n2 = 514;
int n3 = 514;
auto distspec2 = dash::DistributionSpec<3>( dash::BLOCKED, dash::NONE , dash::NONE);
dash::NArray<double, 3> u(n1,n2,n3, distspec2);
dash::NArray<double, 3> v(n1,n2,n3, distspec2);
dash::NArray<double, 3> r(n1,n2,n3, distspec2);
dash::fill(v.begin(), v.end(), 0.0);
dash::fill(u.begin(), u.end(), 0.0);
// printf("v.local(0,0,0)=%f\n", (double) v.local(0,0,0));
if(0 == dash::myid()) {
u(134,154,13) = 10;
u(34,14,139) = 10;
u(167,48,165) = 10;
u(117,214,35) = 10;
u(187,65,241) = 10;
u(67,57,158) = 10;
u(37,210,179) = 10;
u(247,124,138) = 10;
u(42,189,231) = 10;
u(15,124,133) = 10;
}
dash::barrier();
double a[4];
a[0] = -8.0/3.0;
a[1] = 0.0;
a[2] = 1.0/6.0;
a[3] = 1.0/12.0;
std::vector<double> u1(n1);
std::vector<double> u2(n1);
int z_ext = u.local.extent(0);
for(int i3 = 1; i3 < z_ext-1; i3++) {
printf("Unit %d trying local plane %d of %d\n", (int) dash::myid(), i3, z_ext);
for (int i2 = 1; i2 < n2-1; i2++) {
for (int i1 = 0; i1 < n1; i1++) {
u1[i1] = u.local(i3,i2-1,i1) + u.local(i3,i2+1,i1) + u.local(i3-1,i2,i1) + u.local(i3+1,i2,i1);
u2[i1] = u.local(i3-1,i2-1,i1) + u.local(i3-1,i2+1,i1) + u.local(i3+1,i2-1,i1) + u.local(i3+1,i2+1,i1);
}
for (int i1 = 1; i1 < n1-1; i1++) {
r.local(i3,i2,i1) = //v.local(i3,i2,i1)
- a[0] * u.local(i3,i2,i1)
//--------------------------------------------------------------------
//c Assume a(1) = 0 (Enable 2 lines below if a(1) not= 0)
//c-------------------------------------------------------------------
//c > - a(1) * ( u(i1-1,i2,i3) + u(i1+1,i2,i3)
//c > + u1(i1) )
//c-------------------------------------------------------------------
- a[2] * ( u2[i1] + u1[i1-1] + u1[i1+1] )
- a[3] * ( u2[i1-1] + u2[i1+1] );
}
}
}
dash::finalize();
return EXIT_SUCCESS;
}
@Mietzsch Thanks for the report. Could you please add which MPI version you are using and paste the CMake command you used to build DASH (likely the content of you build.sh
)?
One reason for seeing SIGBUS
is that the MPI implementation allocates windows in shared memory and it might be that your VM does not provide sufficient space. That is specific to the MPI implementation though.
I am using MPI version 3.3a2 and I build using the following build.sh
:
#!/bin/sh
BUILD_DIR=./build
FORCE_BUILD=false
if [ "$1" = "-f" ]; then
FORCE_BUILD=true
fi
await_confirm() {
if ! $FORCE_BUILD; then
echo ""
echo " To build using these settings, hit ENTER"
read confirm
fi
}
exit_message() {
echo "--------------------------------------------------------"
echo "Done. To install DASH, run make install in $BUILD_DIR"
}
if [ "${PAPI_HOME}" = "" ]; then
PAPI_HOME=$PAPI_BASE
fi
# To specify a build configuration for a specific system, use:
#
# -DENVIRONMENT_TYPE=<type> \
#
# For available types, see the files in folder ./config.
# To specify a custom build configuration, use:
#
# -DENVIRONMENT_CONFIG_PATH=<path to cmake file> \
#
# To use an existing installation of gtest instead of downloading the sources
# from the google test subversion repository, use:
#
# -DGTEST_LIBRARY_PATH=${HOME}/gtest \
# -DGTEST_INCLUDE_PATH=${HOME}/gtest/include \
#
# To build with MKL support, set environment variables MKLROOT and INTELROOT.
#
# To enable IPM runtime support, use:
#
# -DIPM_PREFIX=<IPM install path> \
# For likwid support, ensure that the likwid development headers are
# installed.
# Configure with default release build settings:
mkdir -p $BUILD_DIR
rm -Rf $BUILD_DIR/*
(cd $BUILD_DIR && cmake -DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DBUILD_GENERIC=OFF \
-DENVIRONMENT_TYPE=default \
-DINSTALL_PREFIX=$HOME/opt/dash-0.4.0/ \
-DDART_IMPLEMENTATIONS=mpi \
-DENABLE_THREADSUPPORT=ON \
-DENABLE_DEV_COMPILER_WARNINGS=OFF \
-DENABLE_EXT_COMPILER_WARNINGS=OFF \
-DENABLE_LT_OPTIMIZATION=OFF \
-DENABLE_ASSERTIONS=ON \
\
-DENABLE_SHARED_WINDOWS=ON \
-DENABLE_DYNAMIC_WINDOWS=ON \
-DENABLE_UNIFIED_MEMORY_MODEL=ON \
-DENABLE_DEFAULT_INDEX_TYPE_LONG=ON \
\
-DENABLE_LOGGING=OFF \
-DENABLE_TRACE_LOGGING=OFF \
-DENABLE_DART_LOGGING=OFF \
\
-DENABLE_LIBNUMA=ON \
-DENABLE_LIKWID=OFF \
-DENABLE_HWLOC=ON \
-DENABLE_PAPI=ON \
-DENABLE_MKL=ON \
-DENABLE_BLAS=ON \
-DENABLE_LAPACK=ON \
-DENABLE_SCALAPACK=ON \
-DENABLE_PLASMA=ON \
-DENABLE_HDF5=ON \
-DENABLE_MEMKIND=ON \
\
-DBUILD_EXAMPLES=ON \
-DBUILD_TESTS=ON \
-DBUILD_DOCS=ON \
\
-DIPM_PREFIX=${IPM_HOME} \
-DPAPI_PREFIX=${PAPI_HOME} \
\
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
../ && \
await_confirm && \
make -j 4) && (cp $BUILD_DIR/compile_commands.json .) && \
exit_message
Unfortunately, I am not too familiar with MPICH. From https://www.mpich.org/static/downloads/3.3/mpich-3.3-README.txt, it looks like MPICH potentially allocates shared memory under either /dev/shm
or /tmp
. Can you please check how much space is free in these directories? The command df -h /tmp/ /tmp/
should provide that information.
/tmp
has 20GB available and /dev/shm
2,5GB.
/dev/shm 2,5GB.
Each matrix you allocate is 1GB, adding up to 3GB
. I'm not sure how to convince MPICH to use /tmp
instead of /dev/shm
, the README.envvars
in the distribution tarball doesn't seem to have a hint. If you have a way of installing Open MPI you can give that a try and set the MCA parameter shmem_mmap_backing_file_base_dir
to /tmp
. Let me know if you need help with that.
Then let me come back to you on this as well, after I've checked this. Thank you very much, this was very helpful.
If possible, you could also try to provide the VM with more memory and/or increase the size of /dev/shm
manually, e.g., following https://masukkhan.wordpress.com/2015/12/09/resize-devshm-filesystem-in-linux/
I increased the size of dev/shm
as you suggested, that solved the issue. Thank you very much!
Thanks for reporting back! We should add a KNOWN_ISSUES
file that describes this.