pecos/tps

Memory error arises when SourceTerm use a device loop

Closed this issue · 3 comments

This issue is implemented in the branch gpu_source_issue.
The device loop only touches the right-hand side as follows:

void SourceTerm::updateTerms_gpu(mfem::Vector &in) {
  const double *h_Up = Up_->Read();
  const double *h_U = U_->Read();
  const double *h_gradUp = gradUp_->Read();
  double *h_in = in.ReadWrite();

  const int nnodes = vfes->GetNDofs();

  MFEM_FORALL(n, nnodes, {
    double upn[gpudata::MAXEQUATIONS];
    double Un[gpudata::MAXEQUATIONS];
    double gradUpn[gpudata::MAXEQUATIONS * gpudata::MAXDIM];
    double srcTerm[gpudata::MAXEQUATIONS];
//
////    for (int eq = 0; eq < num_equation; eq++) {
////      upn[eq] = h_Up[n + eq * nnodes];
////      Un[eq] = h_U[n + eq * nnodes];
////      for (int d = 0; d < dim; d++) gradUpn[eq + d * num_equation] = h_gradUp[n + eq * nnodes + d * num_equation * nnodes];
////    }
////    // TODO(kevin): update E-field with EM coupling.
////    // E-field can have azimuthal component.
////    double Efield[gpudata::MAXDIM];
////    for (int v = 0; v < nvel; v++) Efield[v] = 0.0;
////
////    updateTermAtNode(Un, upn, gradUpn, Efield, srcTerm);
//
    // add source term to buffer
    for (int eq = 0; eq < num_equation; eq++) {
//      h_in[n + eq * nnodes] += srcTerm[eq];
      h_in[n + eq * nnodes] += 0.0;
    }   
  }); 
}

Running test/argon_minimal.binary.test fails in running tps, having a memory error.

CUDA error: (cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice)) failed with error:
 --> an illegal memory access was encountered
 ... in function: void* mfem::CuMemcpyHtoD(void*, const void*, size_t)
 ... in file: general/cuda.cpp:116
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0


CUDA error: (cudaFree(dptr)) failed with error:
 --> driver shutting down
 ... in function: void* mfem::CuMemFree(void*)
 ... in file: general/cuda.cpp:86

This error does not occur in SourceTerm::updateTerms_gpu. This occurs at the line auto d_temp = loc_print.ReadWrite(); in M2ulPhyS::Check_NaN_GPU. Commenting h_in[n + eq * nnodes] += 0.0; in SourceTerm::updateTerms_gpu, however, will solve this issue. At this point, it is not clear how these two functions are involved to cause this issue. It is suspected that ReadWrite() function is sensitive to how it is used.

A couple more attempts showed that commenting where the memory error occurs only postpones the failure, and the same error occurs somewhere later at a random place. It seems that the pointer for mfem::Vector in should be properly created somehow.

The problem here has to do with the fact that MFEM_FORALL expands into a C++ lambda and how the variables are captured. The relevant code snippet in mfem is

// The MFEM_FORALL wrapper
#define MFEM_FORALL(i,N,...)                             \
   ForallWrap<1>(true,N,                                 \
                 [=] MFEM_DEVICE (int i) {__VA_ARGS__},  \
                 [&] MFEM_LAMBDA (int i) {__VA_ARGS__})

This defines two lambdas, one with implicit by-copy capture, for use as a device kernel, and one with implicit by-reference capture, for use on the host, and passes them to ForallWrap (see here), which lauches the kernel on the device if a device is detected.

The problem with the source term code here (ignoring everything that is commented out) is that it creates a lambda that uses data members from the SourceTerm class, specifically num_equation, which requires capturing this. However, even when using by-copy capture, implicit capture of this is always by-reference. See here for more details. This reference isn't valid on the device.

The only workaround I know of that doesn't require changing mfem is to avoid using this. This approach seems to be what @dreamer2368 working toward (e.g., see b692f67).

This issue is now resolved in gpu_source.
Just as other MFEM_FORALL loop, all the variables are pre-defined before the loop to avoid using this.