/datatool

datatool: Dynamical Data Processing Tool for HPCs

Primary LanguageShellGNU General Public License v3.0GPL-3.0

Description

This repository contains scripts to process meteorological datasets in NetCDF file format. The general usage of the script (i.e., ./extract-dataset.sh) is as follows:

Usage:
  extract-dataset [options...]

Script options:
  -d, --dataset                     Meteorological forcing dataset of interest
  -i, --dataset-dir=DIR             The source path of the dataset file(s)
  -v, --variable=var1[,var2[...]]   Variables to process
  -o, --output-dir=DIR              Writes processed files to DIR
  -s, --start-date=DATE             The start date of the data
  -e, --end-date=DATE               The end date of the data
  -l, --lat-lims=REAL,REAL          Latitude's upper and lower bounds;
                                    optional; within the [-90, +90] limits
  -n, --lon-lims=REAL,REAL          Longitude's upper and lower bounds;
                                    optional; within the [-180, +180] limits
  -a, --shape-file=PATH             Path to the ESRI shapefile; optional
  -m, --ensemble=ens1,[ens2,[...]]  Ensemble members to process; optional
                                    Leave empty to extract all ensemble members
  -M, --model=model1,[model2,[...]] Models that are part of a dataset,
                                    only applicable to climate datasets, optional
  -S, --scenario=scn1,[scn2,[...]]  Climate scenarios to process, only applicable
                                    to climate datasets, optional
  -j, --submit-job                  Submit the data extraction process as a job
                                    on the SLURM system; optional
  -k, --no-chunk                    No parallelization, recommended for small domains
  -p, --prefix=STR                  Prefix  prepended to the output files
  -b, --parsable                    Parsable SLURM message mainly used
                                    for chained job submissions
  -c, --cache=DIR                   Path of the cache directory; optional
  -E, --email=user@example.com      E-mail user when job starts, ends, or               
                                    fails; optional
  -u, --account                     Digital Research Alliance of Canada's sponsor's
                                    account name; optional, defaults to 'rpp-kshook' 
  -L, --list-datasets               List all the available datasets and the
                                    corresponding keywords for '--dataset' option
  -V, --version                     Show version 
  -h, --help                        Show this screen and exit

Available Datasets

# Dataset Time Period DOI Description
1 GWF-NCAR WRF-CONUS I Hourly (Oct 2000 - Dec 2013) 10.1007/s00382-016-3327-9 link
2 GWF-NCAR WRF-CONUS II1 Hourly (Jan 1995 - Dec 2015) 10.5065/49SN-8E08 link
3 ECMWF ERA52 Hourly (Jan 1950 - Dec 2020) 10.24381/cds.adbb2d47 and link link
4 ECCC RDRSv2.1 Hourly (Jan 1980 - Dec 2018) 10.5194/hess-25-4917-2021 link
5 CCRN CanRCM4-WFDEI-GEM-CaPA 3-Hourly (Jan 1951 - Dec 2100) 10.5194/essd-12-629-2020 link
6 CCRN WFDEI-GEM-CaPA 3-Hourly (Jan 1979 - Dec 2016) 10.20383/101.0111 link
7 ORNL Daymet Daily (Jan 1980 - Dec 2022)3 10.3334/ORNLDAAC/2129 link
8 Alberta Gov Climate Dataset Daily (Jan 1950 - Dec 2100) 10.5194/hess-23-5151-201 link
9 Ouranos ESPO-G6-R2 Daily (Jan 1950 - Dec 2100) 10.1038/s41597-023-02855-z link
10 Ouranos MRCC5-CMIP6 hourly (Jan 1950 - Dec 2100) TBD link
11 NASA NEX-GDDP-CMIP6 Daily (Jan 1950 - Dec 2100) 10.1038/s41597-022-01393-4 link

General Example

As an example, follow the code block below. Please remember that you MUST have access to Digital Research Alliance of Canada (DRA) clusters (specifically Graham) and have access to RDRSv2.1 model outputs. Also, remember to generate a Personal Access Token with GitHub in advance. Enter the following codes in your Graham shell as a test case:

foo@bar:~$ git clone https://github.com/kasra-keshavarz/datatool # clone the repository
foo@bar:~$ cd ./datatool/ # move to the repository's directory
foo@bar:~$ ./extract-dataset.sh -h # view the usage message
foo@bar:~$ ./extract-dataset.sh  \
  --dataset="rdrs" \
  --dataset-dir="/project/rpp-kshook/Climate_Forcing_Data/meteorological-data/rdrsv2.1" \
  --output-dir="$HOME/scratch/rdrs_outputs/" \
  --start-date="2001-01-01 00:00:00" \
  --end-date="2001-12-31 23:00:00" \
  --lat-lims=49,51  \
  --lon-lims=-117,-115 \
  --variable="RDRS_v2.1_A_PR0_SFC,RDRS_v2.1_P_HU_09944" \
  --prefix="testing_";

See the examples directory for real-world scripts for each meteorological dataset included in this repository.

Logs

The datasets logs are generated under the $HOME/.datatool directory, only in cases where jobs are submitted to clusters' schedulers. If processing is not submitted as a job, then the logs are printed on screen.

New Datasets

If you are considering any new dataset to be added to the data repository, and subsequently the associated scripts added here, you can open a new ticket on the Issues tab of the current repository. Or, you can make a Pull Request on this repository with your own script.

Support

Please open a new ticket on the Issues tab of this repository for support.

License

Meteorological Data Processing Workflow - datatool
Copyright (C) 2022-2023, University of Saskatchewan
Copyright (C) 2023-2024, University of Calgary

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Footnotes

  1. For access to the files on Graham cluster, please contact Stephen O'Hearn.

  2. ERA5 data from 1950-1979 are based on ERA5 preliminary extenion and 1979 onwards are based on ERA5 1979-present.

  3. For the Peurto Rico domain of the dataset, data are available from January 1950 until December 2022.