dash-project/dash

Crash when using large 3D NArrays

Mietzsch opened this issue · 9 comments

When run on a VM with between 3-6GB of RAM and 4 units, this gives a bus error and crashes inside the main for-loop. Depending on the exact hardware, sometimes disabeling line 20 solves this, sometimes it even crashes when the v array is not even declared (line 17). Note that this array is not even needed in the calculation. When run with only 2 units, the program gets about double as far, compared to 4 units, and serial execution works completely fine. I am using dash-0.4.0.

#include <libdash.h>

#include <vector>
using namespace std;


int main(int argc, char* argv[])
{
   dash::init(&argc, &argv);
   int n1 = 514;
   int n2 = 514;
   int n3 = 514;

   auto distspec2 = dash::DistributionSpec<3>( dash::BLOCKED, dash::NONE , dash::NONE);
   dash::NArray<double, 3> u(n1,n2,n3, distspec2);
   dash::NArray<double, 3> v(n1,n2,n3, distspec2);
   dash::NArray<double, 3> r(n1,n2,n3, distspec2);

   dash::fill(v.begin(), v.end(), 0.0);
   dash::fill(u.begin(), u.end(), 0.0);

  // printf("v.local(0,0,0)=%f\n", (double) v.local(0,0,0));

   if(0 == dash::myid()) {
 	u(134,154,13) = 10;
 	u(34,14,139) = 10;
 	u(167,48,165) = 10;
 	u(117,214,35) = 10;
 	u(187,65,241) = 10;
 	u(67,57,158) = 10;
 	u(37,210,179) = 10;
 	u(247,124,138) = 10;
 	u(42,189,231) = 10;
 	u(15,124,133) = 10;
   }

   dash::barrier();
   double a[4];
   a[0] = -8.0/3.0;
   a[1] =  0.0;
   a[2] =  1.0/6.0;
   a[3] =  1.0/12.0;


  std::vector<double> u1(n1);
  std::vector<double> u2(n1);

  int z_ext = u.local.extent(0);

  for(int i3 = 1; i3 < z_ext-1; i3++) {
    printf("Unit %d trying local plane %d of %d\n", (int) dash::myid(), i3, z_ext);
    for (int i2 = 1; i2 < n2-1; i2++) {
      for (int i1 = 0; i1 < n1; i1++) {
        u1[i1] = u.local(i3,i2-1,i1) + u.local(i3,i2+1,i1) + u.local(i3-1,i2,i1) + u.local(i3+1,i2,i1);
        u2[i1] = u.local(i3-1,i2-1,i1) + u.local(i3-1,i2+1,i1) + u.local(i3+1,i2-1,i1) + u.local(i3+1,i2+1,i1);
      }
      for (int i1 = 1; i1 < n1-1; i1++) {
        r.local(i3,i2,i1) = //v.local(i3,i2,i1)
         - a[0] * u.local(i3,i2,i1)
        //--------------------------------------------------------------------
        //c  Assume a(1) = 0	  (Enable 2 lines below if a(1) not= 0)
        //c-------------------------------------------------------------------
        //c > - a(1) * ( u(i1-1,i2,i3) + u(i1+1,i2,i3)
        //c > + u1(i1) )
        //c-------------------------------------------------------------------
         - a[2] * ( u2[i1] + u1[i1-1] + u1[i1+1] )
         - a[3] * ( u2[i1-1] + u2[i1+1] );
      }
    }
  }

   dash::finalize();

   return EXIT_SUCCESS;
}

@Mietzsch Thanks for the report. Could you please add which MPI version you are using and paste the CMake command you used to build DASH (likely the content of you build.sh)?

One reason for seeing SIGBUS is that the MPI implementation allocates windows in shared memory and it might be that your VM does not provide sufficient space. That is specific to the MPI implementation though.

I am using MPI version 3.3a2 and I build using the following build.sh:

#!/bin/sh

BUILD_DIR=./build

FORCE_BUILD=false
if [ "$1" = "-f" ]; then
  FORCE_BUILD=true
fi

await_confirm() {
  if ! $FORCE_BUILD; then
    echo ""
    echo "   To build using these settings, hit ENTER"
    read confirm
  fi
}

exit_message() {
  echo "--------------------------------------------------------"
  echo "Done. To install DASH, run    make install    in $BUILD_DIR"
}

if [ "${PAPI_HOME}" = "" ]; then
  PAPI_HOME=$PAPI_BASE
fi

# To specify a build configuration for a specific system, use:
#
#                    -DENVIRONMENT_TYPE=<type> \
#
# For available types, see the files in folder ./config.
# To specify a custom build configuration, use:
#
#                    -DENVIRONMENT_CONFIG_PATH=<path to cmake file> \
#

# To use an existing installation of gtest instead of downloading the sources
# from the google test subversion repository, use:
#
#                    -DGTEST_LIBRARY_PATH=${HOME}/gtest \
#                    -DGTEST_INCLUDE_PATH=${HOME}/gtest/include \
#

# To build with MKL support, set environment variables MKLROOT and INTELROOT.
#

# To enable IPM runtime support, use:
#
#                    -DIPM_PREFIX=<IPM install path> \

# For likwid support, ensure that the likwid development headers are
# installed.

# Configure with default release build settings:
mkdir -p $BUILD_DIR
rm -Rf $BUILD_DIR/*
(cd $BUILD_DIR && cmake -DCMAKE_BUILD_TYPE=Release \
                        -DBUILD_SHARED_LIBS=OFF \
                        -DBUILD_GENERIC=OFF \
                        -DENVIRONMENT_TYPE=default \
                        -DINSTALL_PREFIX=$HOME/opt/dash-0.4.0/ \
                        -DDART_IMPLEMENTATIONS=mpi \
                        -DENABLE_THREADSUPPORT=ON \
                        -DENABLE_DEV_COMPILER_WARNINGS=OFF \
                        -DENABLE_EXT_COMPILER_WARNINGS=OFF \
                        -DENABLE_LT_OPTIMIZATION=OFF \
                        -DENABLE_ASSERTIONS=ON \
                        \
                        -DENABLE_SHARED_WINDOWS=ON \
                        -DENABLE_DYNAMIC_WINDOWS=ON \
                        -DENABLE_UNIFIED_MEMORY_MODEL=ON \
                        -DENABLE_DEFAULT_INDEX_TYPE_LONG=ON \
                        \
                        -DENABLE_LOGGING=OFF \
                        -DENABLE_TRACE_LOGGING=OFF \
                        -DENABLE_DART_LOGGING=OFF \
                        \
                        -DENABLE_LIBNUMA=ON \
                        -DENABLE_LIKWID=OFF \
                        -DENABLE_HWLOC=ON \
                        -DENABLE_PAPI=ON \
                        -DENABLE_MKL=ON \
                        -DENABLE_BLAS=ON \
                        -DENABLE_LAPACK=ON \
                        -DENABLE_SCALAPACK=ON \
                        -DENABLE_PLASMA=ON \
                        -DENABLE_HDF5=ON \
                        -DENABLE_MEMKIND=ON \
                        \
                        -DBUILD_EXAMPLES=ON \
                        -DBUILD_TESTS=ON \
                        -DBUILD_DOCS=ON \
                        \
                        -DIPM_PREFIX=${IPM_HOME} \
                        -DPAPI_PREFIX=${PAPI_HOME} \
                        \
                        -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
                        ../ && \
 await_confirm && \
 make -j 4) && (cp $BUILD_DIR/compile_commands.json .) && \
exit_message

Unfortunately, I am not too familiar with MPICH. From https://www.mpich.org/static/downloads/3.3/mpich-3.3-README.txt, it looks like MPICH potentially allocates shared memory under either /dev/shm or /tmp. Can you please check how much space is free in these directories? The command df -h /tmp/ /tmp/ should provide that information.

/tmp has 20GB available and /dev/shm 2,5GB.

/dev/shm 2,5GB.

Each matrix you allocate is 1GB, adding up to 3GB. I'm not sure how to convince MPICH to use /tmp instead of /dev/shm, the README.envvars in the distribution tarball doesn't seem to have a hint. If you have a way of installing Open MPI you can give that a try and set the MCA parameter shmem_mmap_backing_file_base_dir to /tmp. Let me know if you need help with that.

Then let me come back to you on this as well, after I've checked this. Thank you very much, this was very helpful.

If possible, you could also try to provide the VM with more memory and/or increase the size of /dev/shm manually, e.g., following https://masukkhan.wordpress.com/2015/12/09/resize-devshm-filesystem-in-linux/

I increased the size of dev/shm as you suggested, that solved the issue. Thank you very much!

Thanks for reporting back! We should add a KNOWN_ISSUES file that describes this.