MCMC step stucked using the ESPEI 0.8.3 version
Opened this issue · 10 comments
Dear Brandon,
There were some problems when I using the 0.8.3 version. It has take a long time when running the MCMC step and finally stuck at the beginning, the log also empties only show some warning. Would you help me with this problem?
condalist.txt
/lustre/home/acct-msezbb/msezbb/.conda/envs/espei2021/lib/python3.9/site-packages/ipopt/init.py:13: FutureWarning: The module has been renamed to 'cyipopt' from 'ipopt'. Please import using 'import cyipopt' and remove all uses of 'import ipopt' in your code as this will be deprecated in a future release.
warnings.warn(msg, FutureWarning)
/lustre/home/acct-msezbb/msezbb/.conda/envs/espei2021/lib/python3.9/site-packages/cyipopt/utils.py:43: FutureWarning: The function named 'setLoggingLevel' will soon be deprecated in CyIpopt. Please replace all uses and use 'set_logging_level' going forward.
warnings.warn(msg, FutureWarning)
Those two warnings are safe to ignore and will go away when pycalphad 0.8.5 is released.
In 0.8 and later, the initial MCMC startup time will likely be a little longer, but overall each iteration should be the same or slightly faster. Can you provide some comparisons of the time to call the likelihood function that is printed out with the verbosity set to 2? In 0.8.3 and the latest 0.7.X release that you had working?
It will take about 3 hours to finish running using 0.7.9+3.gd4625e7=dev_0.
I have set the verbosity to 2. But the log shows nothing after almost 3 days of running.
INFO:espei.espei_script - espei version 0.8.2
INFO:espei.espei_script - If you use ESPEI for work presented in a publication, we ask that you cite the following paper:
B. Bocklund, R. Otis, A. Egorov, A. Obaied, I. Roslyakova, Z.-K. Liu, ESPEI for efficient thermodynamic database development, modification, and uncertainty quantification: application to Cu-Mg, MRS Commun. (2019) 1-10. doi:10.1557/mrc.2019.59.
TRACE:espei.espei_script - Loading and checking datasets.
TRACE:espei.espei_script - Finished checking datasets
TRACE:espei.espei_script - Loading and checking datasets.
TRACE:espei.espei_script - Finished checking datasets
After these steps the dask server usually starts. Maybe your desk server is not starting correctly. Can you make progress with setting scheduler: null
?
I use the high-performance computing center of school to run, so I don’t know how to set this up. Could you teach me how to set it up. All previous versions can run on this platform before.
Can you check that turning off the scheduler works first? I want to make sure everything else is working correctly first. https://espei.org/en/latest/writing_input.html#scheduler
According to the solution you provided, MCMC has started to run normally. Thank you for your help. May I ask what causes this problem?
log2.txt
According to the solution you provided, MCMC has started to run normally.
Great, so it looks like starting dask for parallelization was indeed the issue.
May I ask what causes this problem?
I'm not sure yet, but I think we can figure it out 🙂. ESPEI is intended to work on HPCs and works well when using one compute node without any special configuration.
- Are you trying to use
scheduler: dask
on your cluster or a scheduler file with MPI? - Have you tried again with dask as the scheduler to verify that it's still not working?
- Are you trying to run on one node, multiple nodes? Any other relevant details from your HPC setup or batch submission file (if relevant) would be helpful.
- Since my computer keeps reporting errors after installing Conda, I have been using the school's HPC with scheduler: dask. The MCMC stuck problem was also calculated on the school cluster.
- Today I re-use scheduler: dask to test for 12 hours, but it is still stuck without calculation and no log output.
- I used 40 cores in a node when I calculated using HPC.
This is the distributed.YAML file on the cluster.
ESPEI basically starts a dask cluster this way:
import multiprocessing
from dask.distributed import LocalCluster, Client
cores = multiprocessing.cpu_count()
cluster = LocalCluster(n_workers=cores, threads_per_worker=1, processes=True, memory_limit=0)
client = Client(cluster)
print(client.scheduler_info())
Can you run a Python script containing with this and see it successfully start? The dask documentation may be helpful for you to review. This may require help from your HPC administrator.