T4 lysozyme example with implicit solvent runs out of memory when lots of memory appears to be available
therealchrisneale opened this issue · 0 comments
8 processes works OK: While running mpiexec.hydra -np 8 yank script --yaml=p-xylene-implicit.yaml:
bash-4.2$ free
total used free shared buff/cache available
Mem: 131934588 5161320 114950568 1014712 11822700 124444344
Swap: 0 0 0
20 processes gives an error: While running with mpiexec.hydra -np 20 yank script --yaml=p-xylene-implicit.yaml, just before failure:
bash-4.2$ free
total used free shared buff/cache available
Mem: 131934588 6578724 113531156 1019564 11824708 123022088
Swap: 0 0 0
The first error message and surrounding text were:
<…snip…>
2022-05-20 13:23:05,043: WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
Traceback (most recent call last):
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/schema/validator.py", line 411, in call_constructor
obj = subcls(**constructor_kwargs)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/openmmtools/multistate/replicaexchange.py", line 217, in init
super(ReplicaExchangeSampler, self).init(**kwargs)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/openmmtools/multistate/multistatesampler.py", line 203, in init
self._display_cuda_devices()
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/openmmtools/multistate/multistatesampler.py", line 1772, in _display_cuda_devices
cuda_query_output = os.popen("nvidia-smi --query-gpu=index,gpu_name --format=csv,noheader").read().strip()
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/os.py", line 980, in popen
bufsize=buffering)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/subprocess.py", line 1295, in _execute_child
restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/bin/yank", line 10, in
sys.exit(main())
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/cli.py", line 73, in main
dispatched = getattr(commands, command).dispatch(command_args)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/commands/script.py", line 155, in dispatch
yaml_builder.run_experiments(write_status=write_status)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 747, in run_experiments
group_size = self._get_experiment_mpi_group_size(all_experiments)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 2862, in _get_experiment_mpi_group_size
sampler_names = {self._create_experiment_sampler(exp[1], []).class.name for exp in experiments}
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 2862, in
sampler_names = {self._create_experiment_sampler(exp[1], []).class.name for exp in experiments}
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/experiment.py", line 2990, in _create_experiment_sampler
return schema.call_sampler_constructor(constructor_description)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/schema/validator.py", line 470, in call_sampler_constructor
special_conversions=special_conversions)
File "/usr/projects/mrmdesign/MCMD/CONDA_ENVS/yank-badger/lib/python3.6/site-packages/yank/schema/validator.py", line 413, in call_constructor
raise RuntimeError('Attempt to initialize failed with: {}'.format(str(e)))
RuntimeError: Attempt to initialize failed with: [Errno 12] Cannot allocate memory
2022-05-20 13:23:05,054: CRITICAL - mpiplus.mpiplus - MPI node 1/20 raised an exception and called Abort()! The exception traceback follows
<…snip…>
For what it's worth, I get an entirely different error with -np 25 (so perhaps I am just running things incorrectly since I count 25 lambda values for the complex system):
<...snip...>
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 6939 RUNNING AT ba173
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions