pnnl/ExaGO

Incline Test Failures

Opened this issue · 2 comments

Issue type

  • New feature
  • Bug
  • Discussion
  • Other

Relates to

  • OPFLOW
  • SOPFLOW
  • SCOPFLOW
  • TCOPFLOW
  • CMake build system
  • Spack configuration
  • Manual
  • Web docs
  • Other

Summary

There are two isolated test failures on Incline - one seg fault and one timeout. These are not occurring on Deception or Newell. TBD on other AMD platforms. These were introduced potentially with hiop@1.0.0

Creating a separate issue for these failures to isolate from #3 and #43 and let #84 continue without these tests blocking.

Exact commands to reproduce, if applicable

  • tests are being skipped in CI now, but either run tests manually or delete the incline-skip tag from those tests in the CMake.

Relevant logs and/or screenshots, if applicable

  1. FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE
20/57 Test #20: FUNCTIONALITY_TEST_OPFLOW_RAJAHIOP_SPARSE_GPU_TOML_TESTSUITE .................***Failed    2.76 sec
[ExaGO] Creating OPFlow Functionality Test
Test Description: datafiles/case9/case9mod.m base case
[Warning] Hiop does not understand option 'dualsInitialization' and will ignore its value 'zero'.
[Warning] Detected 1 fixed variables out of a total of 24.
===============
Hiop SOLVER
===============
Using 1 MPI ranks.
---------------
Problem Summary
---------------
Total number of variables: 24
     lower/upper/lower_and_upper bounds: 16 / 16 / 16
Total number of equality constraints: 18
Total number of inequality constraints: 18
     lower/upper/lower_and_upper bounds: 18 / 18 / 18
iter    objective     inf_pr     inf_du   lg(mu)  alpha_du   alpha_pr linesrch
   0  1.0318125e+04 1.800e+00  4.460e+03  -1.00  0.000e+00  0.000e+00  -(-)
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and https://petsc.org/release/faq/
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
  1. FUNCTIONALITY_TEST_SOPFLOW_SCENARIO_RAJA_GPU_TOML

Is this issue only on Ascent OR does this happen on other platforms too?

Is this issue only on Ascent OR does this happen on other platforms too?

This behavior is only happening on Incline (not Deception, Newell, or Ascent), @nkoukpaizan was also seeing similar failures on Frontier in #89 so likely AMD related