choderalab/yank

t4 lysozyme tutorial with implicit solvent appears to hang at first instance of execute _propagate_replica()

therealchrisneale opened this issue · 0 comments

Hello,

I find that the following command takes 4 minutes to get to the point of “execute _propagate_replica(0)” but then produces no more output while still consuming CPU resources for more than 10 minutes.

mpiexec.hydra -np 8 yank script --yaml=p-xylene-implicit.yaml

bash-4.2$ head -n 20 nohup.out 
Running simulation...
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
Warning: importing 'simtk.openmm' is deprecated.  Import 'openmm' instead.
2022-05-20 13:01:11,339: Setting 'CpuThreads' to 1 because MPI is active.
2022-05-20 13:01:11,629: Node 1/8: executing <function ExperimentBuilder._check_resume at 0x2b4fbddf40d0>
2022-05-20 13:01:11,631: Node 1/8: waiting for barrier after <function ExperimentBuilder._check_resume at 0x2b4fbddf40d0>
2022-05-20 13:01:11,727: Group 1/8 Node 1/1: execute _setup_molecules(p-xylene)
2022-05-20 13:01:12,583: Fixing net charge from -2.000000000015878e-06 to 4.163336342344337e-17
2022-05-20 13:01:12,595: Node 1/8: waiting for barrier after _setup_molecules
2022-05-20 13:01:12,604: Group 1/8 Node 1/1: execute get_system(t4-xylene)
2022-05-20 13:01:12,606: Setting up the systems for t4-lysozyme and p-xylene using solvent GBSA
2022-05-20 13:01:12,606: Setting up solvent phase
2022-05-20 13:01:13,047: Setting up complex phase
2022-05-20 13:01:13,831: Node 1/8: waiting for barrier after get_system

bash-4.2$ tail -n 20 nohup.out 
2022-05-20 13:05:09,532: on stmt: size = arg(0, name=size)
2022-05-20 13:05:09,532: on stmt: $0.1 = global(np: <module 'numpy' from '/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/numpy/__init__.py'>)
2022-05-20 13:05:09,532: on stmt: $0.2 = getattr(value=$0.1, attr=random)
2022-05-20 13:05:09,532: on stmt: $0.3 = getattr(value=$0.2, attr=random)
2022-05-20 13:05:09,532: on stmt: $0.4 = call $0.3(func=$0.3, args=[], kws=(), vararg=None)
2022-05-20 13:05:09,533: on stmt: $0.5 = cast(value=$0.4)
2022-05-20 13:05:09,533: on stmt: return $0.5
2022-05-20 13:05:09,533: defs defaultdict(<class 'list'>,
            {'$0.1': [<numba.core.ir.Assign object at 0x2b4fc24cd908>],
             '$0.2': [<numba.core.ir.Assign object at 0x2b4fc24cd9e8>],
             '$0.3': [<numba.core.ir.Assign object at 0x2b4fc24cdac8>],
             '$0.4': [<numba.core.ir.Assign object at 0x2b4fc24cdba8>],
             '$0.5': [<numba.core.ir.Assign object at 0x2b4fc24cdc88>],
             'size': [<numba.core.ir.Assign object at 0x2b4fc24cd828>]})
2022-05-20 13:05:09,533: SSA violators set()
2022-05-20 13:05:09,862: Mixing of replicas took    0.595s
2022-05-20 13:05:09,862: Accepted 31250/31250 attempted swaps (100.0%)
2022-05-20 13:05:09,862: Node 1/8: waiting for broadcast of <function ReplicaExchangeSampler._mix_replicas at 0x2b4fbbf810d0>
2022-05-20 13:05:09,863: Propagating all replicas...
2022-05-20 13:05:09,863: Node 1/8: execute _propagate_replica(0)

No more output is produced, though top indicates that the processes are still consuming CPU resources

bash-4.2$ date; tail -n 1 nohup.out
Fri May 20 13:17:51 MDT 2022
2022-05-20 13:05:09,863: Node 1/8: execute _propagate_replica(0)

Thank you,
Chris.