EDmodel/ED2

Computing resources for large numbers of simulations

Opened this issue · 3 comments

I'd like to poll the community and ask: what computing resources folks use for their ED2 simulations? Supercomputing clusters? The cloud? Massive numbers of desktop machines?

I have 500 simulations planned, all single-site. Individually, they can easily run on a desktop, but at about 100 hours per run. I am trying to decide the best way to get them all done.

My assumption is that because these are single-site runs, they would not benefit from MPI and would not run any faster on a supercomputer. The main benefit there would be running several at once, if I could reserve a node for that long.

I have done a test simulation on Amazon AWS, and it definitely works, but it will pretty expensive to do them all that way.

I'd love to hear any advice or anecdotes you have.

The MPI can definitely help speed up single-site runs -- the degree to which it does so depends on how many patches you run has since patches are what get farmed out to different threads. For my various large ensemble work, it's a tradeoff between how many instances you want running at once versus how fast you want each instance to run and I typically end up with a compromise of the two depending on which system I run on. At this point, I pretty much only run with the MPI turned on, but am probably an outlier in that.

Others can chime in with what they do, but I've done a combination of a computing cluster with a managed queue (which is really nice because you can just set up the list and it will crank through them) and manually batch-starting jobs on a local machine/server (tedious, but easier to monitor/debug). I almost always beta-test the workflows on a fresh (semi-)local machine instance where it's easier to monitor progress and output before farming out to a cluster where it typically has to be more hands-off. I always find compiling issues when trying to set up on a new cluster so it's helpful to know what is compiling versus workflow problems.

If your runs might be sensitive to conditions that would cause fails (integration step not converging; potentially wonky met; buggy code) I have some bash scripts developed to monitor the status and do a restart or send an email based on whether the run finished, timed out, or failed if they would be helpful.

We do jobs like that on the cluster by sending individual runs into the queue so that they run on different nodes at the same time. This sort of execution is naively parallel -- no need for MPI. If your cluster doesn't have the capacity I'd recommend trying to get an XSEDE or Cyverse allocation before paying for AWS. For any cloud work (Cyverse, AWS, etc) you can leverage the ED2 docker container the PEcAn team built rather than trying to build ED2 on each instance https://hub.docker.com/r/pecan/model-ed2-2.2.0

Just a minor clarification: for single-site runs, the openmp library is what helps speeding up the simulations (if compiling with ifort, it's the open -qopenmp, and if compiling with gfortran, the option is -fopenmp). Depending on the job scheduler, you need to specify how many cpus per task.

If you use SLURM, you must set --cpus-per-task=12 (or the number of cores per socket available), and set up the global variable, something like this:

OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} ed_2.2-opt -f ED2IN_myrun

Regardless of the job scheduler, to make sure the model is actually using OpenMP, you can check the beginning of the standard output. Look for OMP parallel info: the threads max count should be the number of cpus-per-task requested.