LLNL/STAT

Error using stat-cl: "terminate called without an active exception"

Closed this issue · 19 comments

Hello,

I am not sure if this error is related to STAT, Launchmon, or something to do with Slurm. I have installed STAT 4.0.1 with Spack release 0.14. The build completed successfully. However I did not specify to use Slurm as the resource manager like I did previously with the Launchmon manual build.

When using stat-cl to launch an application on our cluster, I see the following error reported from Slurrm:

terminate called without an active exception

And the output shows the date/time stamp and the daemon launching statement from STAT:

STAT started at 2020-03-04-06:50:07
Launching application and tool daemons...

Has anyone seen this before or have an idea on what might be causing this error?

Best regards,

-Rashawn

@rashawnLK, someone reported a similar error in #15. Can you get me some more debug information? First, can you send me the full command you run for your parallel slurm job and the full output. I am assuming that the "terminate..." message is coming from srun?

Also, can you try enable logging in STAT? You can run stat-cl -l FE -l BE -L$HOME/statlogs and this will create some log files in the $HOME/statlogs directory. I'd be curious to see how far STAT is making it. On the surface, though, it looks like srun is dying due to STAT/LaunchMON poking at it. Which version of SLURM is this?

@lee218llnl, thank you for the pointer to #15 , and the debugging/logging semantics for stat-cl.

The full command for the parallel Slurm job, enclosed in a sbatch script is:

stat-cl -C srun -n16 -N16 --time=5 ./amg2013 -pooldist 1 -r 4 4 4 -P 1 2 1 -printstats without debugging/logging
and with it:

stat-cl -l FE -l BE -L $statlogdir -C srun -n16 -N16 --time=5 ./amg2013 -pooldist 1 -r 4 4 4 -P 1 2 1 -printstats

The sbatch script is called using sbatch in the following manner:

sbatch --reservation=rashawn -N16 rlk_amg2013_stat_gcc-8.3.0_mpich_16N2t_1ppn.sbatch.sh

The full output from the debugging/logging run as printed in the slurm *.out file is:

STAT started at 2020-03-04-09:08:55
Launching application and tool daemons...

There is not a Slurm error file with this, but the logging output contains this:

Mar 04 09:08:55> STAT.C:142 STAT started at 2020-03-04-09:08:55
<Mar 04 09:08:55> <STAT_FrontEnd.C:460> Launching application and tool daemons...
<Mar 04 09:08:55> <STAT_FrontEnd.C:558> Initializing LaunchMON

I do not see the error regarding "terminate called without an active exception" in any of the output. but the job does end prematurely, as the nodes are returned as available almost immediately after receiving the job id.

My version of SLURM is 18.08.7

Thank you,

-Rashawn

@rashawnLK if you built STAT with spack can you try adding ~fgfs to the spec (i.e., spack install stat~fgfs)?

@lee218llnl, I added ~fgfs to my spack install statement:
spack -C ${confs} install stat~fgfs %gcc@8.3.0 ^ncurses+termlib 2>&1 | tee logs/2020_0304_stat-4.0.1_gcc-8.3.0_python3_mpich-gnu32_spack-01.out

That change is the only change I made to the installation statement.

This installation did not progress very far; and exited without a reasonable error message or a log file written by Spack:
... [+] /home/rlknapp/builds/Spack/stat-4.0.1/install8-repo8-stat-poc/linux-sles15-skylake_avx512/gcc-8.3.0/libxslt-1.1.33-4yaural ==> 46946: Installing launchmon ==> Fetching https://github.com/LLNL/LaunchMON/releases/download/v1.0.2/launchmon-v1.0.2.tar.gz ==> Staging archive: /home/rlknapp/sources/spack-repo8-stat-poc/spack/var/spack/stage/spack-stage-launchmon-1.0.2-4467lbvega4s5osuaagcsyns64xwxmrb/launchmon-v1.0.2.tar.gz ==> Created stage in /home/rlknapp/sources/spack-repo8-stat-poc/spack/var/spack/stage/spack-stage-launchmon-1.0.2-4467lbvega4s5osuaagcsyns64xwxmrb ==> Applied patch /home/rlknapp/sources/spack-repo8-stat-poc/spack/var/spack/repos/builtin/packages/launchmon/launchmon-char-conv.patch ==> 46946: launchmon: Building launchmon [AutotoolsPackage] ==> 46946: launchmon: Executing phase: 'autoreconf' ==> 46946: launchmon: Executing phase: 'configure' ==> 46946: launchmon: Executing phase: 'build' ==> 46946: launchmon: Executing phase: 'install' ==> 46946: launchmon: Successfully installed launchmon Fetch: 4.17s. Build: 47.27s. Total: 51.44s. [+] /home/rlknapp/builds/Spack/stat-4.0.1/install8-repo8-stat-poc/linux-sles15-skylake_avx512/gcc-8.3.0/launchmon-1.0.2-4467lbv ==> Error: Installation of stat failed. Review log for details

I am not really sure where this log file is as the path and name of it are not mentioned in the output. I am using Spack 0.14, and I will inquire about this on the Spack github tomorrow morning. I have attached my output from the installation.

Thank you,

-Rashawn

2020_0304_stat-4.0.1_gcc-8.3.0_python3_mpich-gnu32_spack-02.out.txt

@rashawnLK I generally parse the output looking for any reference to "error". There seem to be a bunch of curl errors, so it looks like it was having trouble downloading some of the source tar balls. This can often be an intermittent issue with the download site, so I first suggest retrying the exact same spack install command. You can then also check to make sure you have access to the tar file being referenced in the error message. For instance, the first one I can see is https://www.x.org/archive/individual/lib/libXdmcp-1.1.2.tar.gz, which I am able to access from my site from a web browser. Please add me as a watcher to any issue you report on GitHub to Spack too.

Hi Greg @lee218llnl, I looked at the curl errors and discovered that I am able to easily retrieve the libXdmpc package. I suspect something happened during my installation the other day which caused this package to not install. The good news is I re-installed STAT successfully earlier today with the semantics used before:
spack -C ${confs} install stat~fgfs %gcc@8.3.0 ^ncurses+termlib 2>&1 | tee logs/2020_0306_stat-4.0.1_gcc-8.3.0_python3_mpich-gnu32_spack-01.out

However, I encounter the same error as before:
terminate called without an active exception printed to into the Slurm error file when stat-cl logging is not enabled; and when it is:
<Mar 06 14:04:01> <STAT.C:142> STAT started at 2020-03-06-14:04:01 <Mar 06 14:04:01> <STAT_FrontEnd.C:460> Launching application and tool daemons... <Mar 06 14:04:01> <STAT_FrontEnd.C:558> Initializing LaunchMON

I attached the installation output.

Thank you,

-Rashawn

2020_0306_stat-4.0.1_gcc-8.3.0_python3_mpich-gnu32_spack-01.out.txt

Hmm, I'm not sure what is going wrong here. What version of SLURM are you using? Would you be able to try compiling with Intel MPI and using the Intel MPI's mpirun? Another option would be to do similarly with OpenMPI. Also, do you have totalview or ddt that you can use to try and debug your SLURM-launched job? That might provide another interesting data point.

@rashawnLK any updates on this? Please see my suggestions in my previous comment.

@lee218llnl, I compiled with Intel MPI and tested using mpirun instead of srun; and same error where it does not appear to launch happens. I did a test last week and re-ran it today. For a 16 node run of AMG2013, the stat-cl call with logging enabled, using mpirun is:

stat-cl -l FE -l BE -L /home/rlknapp/sources/A21-workloads/amg2013/gcc-8.3.0/impi-2020.0.166/AMG2013/test/statlogs -C mpirun -np 16 -host c002n[0020-0035] ./amg2013 -pooldist 1 -r 4 4 4 -P 1 2 1 -printstats

The only output printed to the Slurm *.out file is:
STAT started at 2020-03-23-15:40:14 Launching application and tool daemons...

The stat log file shows the following:
<Mar 23 15:40:14> <STAT.C:142> STAT started at 2020-03-23-15:40:14 <Mar 23 15:40:14> <STAT_FrontEnd.C:460> Launching application and tool daemons... <Mar 23 15:40:14> <STAT_FrontEnd.C:558> Initializing LaunchMON

I do not have access to an installation of TotalView or DDT.

The version of Slurm we are using is 18.08.7.

Thank you,

-Rashawn

I don't know if he has time to help, but I'm going to rope in @dongahn. @rashawnLK you may want to try downloading LaunchMON source https://github.com/llnl/launchmon and follow the build instructions in the README, being sure to configure with "--with-test-rm=slurm" and to run make check after make install. This will build some tests in test/src and I suggest trying to run test.launch_1 and if that works, then try test.attach_1. This will help determine if the LaunchMON-SLURM interactions are working properly.

@lee218llnl. Thanks for the build from source tip. I will do this.

-Rashawn

rashawn, I put in a PR to launchmon to fix some of the test building so that it'll work with spack. If you modify your launchmon spack package.py, then it can install tests too:

[lee218@rzwiz2:spack]$ git diff var/spack/repos/builtin/packages/launchmon/package.py
diff --git a/var/spack/repos/builtin/packages/launchmon/package.py b/var/spack/repos/builtin/packages/launchmon/package.py
index 4b80687..5c24863 100644
--- a/var/spack/repos/builtin/packages/launchmon/package.py
+++ b/var/spack/repos/builtin/packages/launchmon/package.py
@@ -14,6 +14,7 @@ class Launchmon(AutotoolsPackage):
     git      = "https://github.com/llnl/launchmon.git"
 
     version('master', branch='master')
+    version('1.0.3b', commit='ae8dde53a1e8d851a6e449cb8e87001e9d5d7534')
     version('1.0.2', sha256='1d301ccccfe0873efcd66da87ed5e4d7bafc560b00aee396d8a9365f53b3a33a')
 
     depends_on('autoconf', type='build', when='@master')
@@ -28,6 +29,18 @@ class Launchmon(AutotoolsPackage):
 
     patch('launchmon-char-conv.patch', when='@1.0.2')
 
+    def build(self, spec, prefix):
+        make()
+        make('check')
+        make('install')
+
+    def configure_args(self):
+        spec = self.spec
+        args = []
+        if spec.satisfies('@master') or spec.satisfies('@1.0.3b'):
+            args += ['--with-test-rm=slurm', '--with-test-ncore-per-CN=2', '--with-test-nnodes=2', '--with-test-rm-launcher=/usr/bin/srun', '--with-test-installed=install-tests']
+        return args
+
     def setup_build_environment(self, env):
         if self.spec.satisfies('@master'):
             # automake for launchmon requires the AM_PATH_LIBGCRYPT macro

the tests will install in /share/launchmon/tests. You should first try the test.launch_1 script. Note you may have to edit the script, in particular it is specifying "pdebug" as the partition to launch under slurm, which is an LLNL-specific partition.

@rashawnLK have you tried the launchmon tests?

@lee218llnl, I got this working as you described. Thank you!

@rashawnLK at what level is this working? Did it install? Were you able to successfully run the launchmon tests? Were you able to get STAT to work?

I ran the launchmon tests. My cluster will be offline tomorrow, so I will resume STAT incorporation the day after that.

@rashawnLK : Last year I had lot of troubles using STAT in #15. As reported in #24, I am able to build & run 4.1.0 perfectly fine. So may be you can try the same.

@rashawnLK It looks like we haven't visited this issue in a while. Are you still in need of assistance running STAT?

@lee218llnl, we are okay at this time. The issue, I think , was not identifying the PID correctly for stat-cl. We will be venturing into the 4.2* series soon.