GLM modeling of lake temperatures

This repository is for running uncalibrated GLM models of lake temperatures.


Dependent files

  • GLM 3 template
    • '1_prep/in/glm3_template.nml' (committed to repo)
  • NLDAS driver files (stored on Caldera)
    • e.g., '1_prep/in/NLDAS_time[0.379366]_x[231]_y[167].csv'

Files from lake-temperature-model-prep pipeline that will eventually be transferred using GLOBUS (location in lake-temperature-model-prep --> location in this pipeline):

  • List of lake-specific attributes for nml modification: '7_config_merge/out/nml_list.rds' --> '1_prep/in/nml_list.rds'
  • Temperature observations: '7b_temp_merge/out/merged_temp_data_daily.feather' --> '1_prep/in/merged_temp_data_daily.feather'
  • Lake-to-state crosswalk: '2_crosswalk_munge/out/lake_to_state_xwalk.rds' --> '1_prep/in/lake_to_state_xwalk.rds'
  • Lake - GCM cell tile crosswalk: '7_drivers_munge/out/lake_cell_tile_xwalk.csv' --> '1_prep/in/lake_cell_tile_xwalk.csv'
  • Munged GCM netCDF files (one per GCM): '7_drivers_munge/out/GCM_{gcm name}.nc' --> '1_prep/in/GCM_{gcm name}.nc'

Running the pipeline on HPC, in parallel

Tallgrass quickstart

ssh tallgrass.cr.usgs.gov
cd /caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models

# Change user permissions for collaboration.
# Best practice is to add this line to your `~/.bashrc` on tallgrass, so you don't forget!
umask 002

Singularity

Singularity is a program for running code in containers. It is fundamentally similar to docker, and is capable of generating containers based on docker images. It is the containerization technology used in the tallgrass and yeti HPC environments. For more information, see here.

For the following applications, you'll need to load the singularity and slurm modules:

module load singularity slurm

Here's how to get the image that Jesse built from Dockerhub and translate it to Singularity (this has already been done for the image listed below, and should only need to be re-done if a new image is built):

cd /caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models
singularity pull docker://jrossusgs/glm3r:v0.7
# now you can see the singularity image: it is a file called glm3r_v0.7.sif

Running the pipeline in the Singularity container

Here's how to build targets using the Singularity container and the targets::tar_make_clustermq(target_name, workers=n_workers) option to build targets in parallel within the Singularity container, with a specified number of workers (up to 72, as Tallgrass has 72 cores per node). The srun command will allocate a node and then run the specified command, in this case Rscript. Targets will then delegate work out to n_workers cores for any parallelizable step that you don't specifically tell it to run in serial.

# Build GCM-driven GLM models, in parallel
srun --pty -c 72 -t 7:00:00 -A watertemp singularity exec glm3r_v0.7.sif Rscript -e 'targets::tar_make_clustermq(p2_gcm_glm_uncalibrated_runs, workers=60)'
# Build NLDAS-driven GLM models, in parallel
srun --pty -c 72 -t 1:00:00 -A watertemp singularity exec glm3r_v0.7.sif Rscript -e 'targets::tar_make_clustermq(p2_nldas_glm_uncalibrated_runs, workers=60)'

Running the pipeline interactively

Here's how to run the Singularity container interactively on an allocated job:

srun --pty -c 72 -t 10:00:00 -A watertemp singularity exec glm3r_v0.7.sif bash
R
library(targets)
tar_make_clustermq(p2_gcm_glm_uncalibrated_runs, workers=72)
# etc

Example interactive workflow that launches model runs on Tallgrass

In git bash window:

[hcorson-dosch@tg-login1 ~] ssh tallgrass.cr.usgs.gov
[hcorson-dosch@tg-login1 ~] cd /caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models
[hcorson-dosch@tg-login1 lake-temperature-process-models] umask 002
[hcorson-dosch@tg-login1 lake-temperature-process-models] screen # set up screen so that if lose Pulse Secure connection, run continues
[hcorson-dosch@tg-login1 lake-temperature-process-models] module load singularity slurm
[hcorson-dosch@tg-login1 lake-temperature-process-models] srun --pty -c 72 -t 10:00:00 -A watertemp singularity exec glm3r_v0.7.sif bash # Here I'm requesting 72 cores (1 node) for 10 hours

Once the resources have been allocated, you'll immediately be transferred to the allocated node, and will be in the container environment.

To access R, simply type R:

hcorson-dosch@ml-0008:/caldera/projects/usgs/water/iidd/datasci/lake-temp/lake-temperature-process-models$ R

R version 4.1.2 (2021-11-01) -- "Bird Hippie"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

Once in R, you could immediately launch the model runs (here, the NLDAS model runs):

> library(targets) # load targets
> tar_make_clustermq(p3_nldas_glm_uncalibrated_output_zips, reporter='summary', workers=60) # To run the NLDAS models and extract and package the output

Or build other targets (e.g., the model configuration) before launching the model runs (here, the GCM model runs):

> library(targets)
> tar_make_clustermq(p1_gcm_model_config, reporter='summary', workers=60) # Typically I build the config first so that I can check it before launching the model run - here I'm building the GCM model config
> tar_load(p1_gcm_model_config)
> tar_load(p1_site_ids)
> nrow(p1_gcm_model_config) == (length(p1_site_ids)*6*3) # Check # of model runs is correct, for GCMs thats # lakes * 6 gcms * 3 time periods
> Sys.time() # I find it helpful to have a console record of the time when I launch a run
> tar_make_clustermq(p2_gcm_glm_uncalibrated_runs, reporter='summary', workers=60) # To launch just the model runs
> tar_make_clustermq(p3_gcm_glm_uncalibrated_output_zips, reporter='summary', workers=50) # To launch the GCM model runs *and* extract and package the output
> library(tidyverse)
> tar_load(p2_gcm_glm_uncalibrated_runs)
> nrow(filter(p2_gcm_glm_uncalibrated_runs, glm_success==FALSE)) # check how many runs failed
> failed_runs <- p2_gcm_glm_uncalibrated_runs %>% filter(glm_success==FALSE) %>% group_by(site_id) %>% summarize(n_failed_runs = n()) # get summary of # of failed runs per lake
> nrow(failed_runs) # check how many lakes had failed runs
> tar_load(p2_gcm_glm_uncalibrated_run_groups) 
> length(unique(p2_gcm_glm_uncalibrated_run_groups$site_id)) # check for how many lakes all 18 runs (6 GCMs * 3 time periods) succeeded and therefore for how many lakes results will be extracted in 3_extract

Note: I've been using a number of workers < 72 in my tar_make_clustermq() command (despite having an allocated node with 72 cores) because I noticed when calling tar_make_clustermq() with workers=72 that the pipeline would sometimes hit an error: Error in tar_throw_run(target$metrics$error) : Resource temporarily unavailable, with warnings about 'unclean shutdown for PIDs', particularly when building the output feather files. It seems to runs more smoothly if you run tar_make_clustermq() with fewer workers than the number of available cores. For generating the output files I had to drop it to workers = 50.

Editing the pipeline in RStudio on Tallgrass

You can also get an interactive RStudio on tallgrass. The best documentation for this is currently here. The tl;dr is

# Launch the session
sbatch launch-rstudio-container.slurm
# Make sure the session is running on a compute node
squeue -u jross
# Now read the generated instructions for how to access the session
cat tmp/rstudio_jross.out

Caution

RStudio may not be as good an environment for running parallelized targets pipelines as running them through Rscript -e. The clustermq user guide says that the multicore scheduler sometimes causes problems in RStudio. I haven't run into this, but if it happens, you might need to switch to multiprocess. This uses more RAM. Might not be a problem, just something to be aware of!


Building the Docker image

This is as simple as editing the Dockerfile and running a command to rebuild it. What follows is a teaser. It won't be as simple as this, because currently the image is hosted on Jesse's Docker Hub. We should put the image on the CHS docker server instead, but we can wait until when (or, if) it needs to be built again to do so.

cd docker
docker-compose build   # maybe change version tag in docker-compose.yml first
docker-compose up      # test it
docker-compose push    # push the updated image to the server

Running the pipeline locally, in serial

You can simply build targets as normal, using tar_make(), and targets will ignore the cluster_mq.scheduler options set in '_targets.R'

Running the pipeline locally, in parallel

The pipeline can be run in parallel locally through docker, just as it can be run through Singularity on tallgrass.

Simple command-line R interface:

docker pull jrossusgs/glm3r:v0.7
cd ~/lake-temperature-process-models
docker run -v '/home/jross/lake-temperature-process-models/:/lakes' -it jrossusgs/glm3r:v0.7 R
# Now you have an R prompt in the container, with the project directory mounted at `/lakes/`.
# You can `setwd("/lakes")` and start working.

Or alternatively, you could run RStudio in the container and access it through your browser (user is rstudio, password set in the startup command as mypass).

docker pull jrossusgs/glm3r:v0.7
cd ~/lake-temperature-process-models
docker run -v '/home/jross/lake-temperature-process-models/:/lakes' -p 8787:8787 -e PASSWORD=mypass -e ROOT=TRUE -d jrossusgs/glm3r:v0.7
setwd("/lakes") 
# Do a lot of work at once and test your computer's fan
targets::tar_make_clustermq(p2_gcm_glm_uncalibrated_runs, workers = 32)