Software to train/evaluate models to reconstruct missing values in climate data (e.g., HadCRUT4) based on a U-Net with partial convolutions.
- pytorch>=1.11.0
- tqdm>=4.64.0
- torchvision>=0.12.0
- torchmetrics>=0.11.2
- numpy>=1.21.6
- matplotlib>=3.5.1
- tensorboardX>=2.5
- tensorboard>=2.9.0
- xarray>=2022.3.0
- dask>=2022.7.0
- netcdf4>=1.5.8
- setuptools==59.5.0
- xesmf>=0.6.2
- cartopy>=0.20.2
- numba>=0.55.1
An Anaconda environment with all the required dependencies can be created using environment.yml
:
conda env create -f environment.yml
To activate the environment, use:
conda activate crai
environment-cuda.yml
should be used when working with GPUs using CUDA.
climatereconstructionAI
can be installed using pip
in the current directory:
pip install .
The software can be used to:
- train a model (training)
- infill climate datasets using a trained model (evaluation)
The directory containing the climate datasets should have the following sub-directories:
data
andval
for trainingtest
for evaluation
The climate datasets should be in netCDF format and placed in the corresponding sub-directories.
The missing values can be defined separately as masks containing zeros (for the missing values) and ones (for the valid values). These masks should be in netCDF format and have the same dimension as the climate dataset. For the training, it is possible to shuffle the sequence of masks by using the "shuffle-masks" option.
A PyTorch model is required for the evaluation.
Once installed, the package can be used as:
- a command line interface (CLI):
- training:
crai-train
- evaluation:
crai-evaluate
- a Python library:
- training:
from climatereconstructionai import train train()
- evaluation:
from climatereconstructionai import evaluate evaluate()
For more information about the arguments:
usage: crai-train [-h] [--data-root-dir DATA_ROOT_DIR] [--mask-dir MASK_DIR] [--log-dir LOG_DIR] [--data-names DATA_NAMES] [--mask-names MASK_NAMES] [--data-types DATA_TYPES]
[--n-target-data N_TARGET_DATA] [--device DEVICE] [--shuffle-masks] [--channel-steps CHANNEL_STEPS] [--lstm-steps LSTM_STEPS] [--gru-steps GRU_STEPS]
[--encoding-layers ENCODING_LAYERS] [--pooling-layers POOLING_LAYERS] [--conv-factor CONV_FACTOR] [--weights WEIGHTS] [--steady-masks STEADY_MASKS]
[--loop-random-seed LOOP_RANDOM_SEED] [--cuda-random-seed CUDA_RANDOM_SEED] [--deterministic] [--attention] [--channel-reduction-rate CHANNEL_REDUCTION_RATE] [--disable-skip-layers]
[--disable-first-bn] [--masked-bn] [--lazy-load] [--global-padding] [--normalize-data] [--n-filters N_FILTERS] [--out-channels OUT_CHANNELS] [--dataset-name DATASET_NAME]
[--min-bounds MIN_BOUNDS] [--max-bounds MAX_BOUNDS] [--profile] [--val-names VAL_NAMES] [--snapshot-dir SNAPSHOT_DIR] [--resume-iter RESUME_ITER] [--batch-size BATCH_SIZE]
[--n-threads N_THREADS] [--multi-gpus] [--finetune] [--lr LR] [--lr-finetune LR_FINETUNE] [--max-iter MAX_ITER] [--log-interval LOG_INTERVAL]
[--lr-scheduler-patience LR_SCHEDULER_PATIENCE] [--save-model-interval SAVE_MODEL_INTERVAL] [--n-final-models N_FINAL_MODELS] [--final-models-interval FINAL_MODELS_INTERVAL]
[--loss-criterion LOSS_CRITERION] [--eval-timesteps EVAL_TIMESTEPS] [-f LOAD_FROM_FILE] [--vlim VLIM] [--lambda-loss LAMBDA_LOSS] [--val-metrics VAL_METRICS]
[--tensor-plots TENSOR_PLOTS] [--early-stopping-delta EARLY_STOPPING_DELTA] [--early-stopping-patience EARLY_STOPPING_PATIENCE] [--n-iters-val N_ITERS_VAL]
options:
-h, --help show this help message and exit
--data-root-dir DATA_ROOT_DIR
Root directory containing the climate datasets
--mask-dir MASK_DIR Directory containing the mask datasets
--log-dir LOG_DIR Directory where the log files will be stored
--data-names DATA_NAMES
Comma separated list of netCDF files (climate dataset) for training/infilling
--mask-names MASK_NAMES
Comma separated list of netCDF files (mask dataset). If None, it extracts the masks from the climate dataset
--data-types DATA_TYPES
Comma separated list of variable types, in the same order as data-names and mask-names
--n-target-data N_TARGET_DATA
Number of data-names (from last) to be used as target data
--device DEVICE Device used by PyTorch (cuda or cpu)
--shuffle-masks Select mask indices randomly
--channel-steps CHANNEL_STEPS
Comma separated number of considered sequences for channeled memory:past_steps,future_steps
--lstm-steps LSTM_STEPS
Comma separated number of considered sequences for lstm: past_steps,future_steps
--gru-steps GRU_STEPS
Comma separated number of considered sequences for gru: past_steps,future_steps
--encoding-layers ENCODING_LAYERS
Number of encoding layers in the CNN
--pooling-layers POOLING_LAYERS
Number of pooling layers in the CNN
--conv-factor CONV_FACTOR
Number of channels in the deepest layer
--weights WEIGHTS Initialization weight
--steady-masks STEADY_MASKS
Comma separated list of netCDF files containing a single mask to be applied to all timesteps. The number of steady-masks must be the same as out-channels
--loop-random-seed LOOP_RANDOM_SEED
Random seed for iteration loop
--cuda-random-seed CUDA_RANDOM_SEED
Random seed for CUDA
--deterministic Disable cudnn backends for reproducibility
--attention Enable the attention module
--channel-reduction-rate CHANNEL_REDUCTION_RATE
Channel reduction rate for the attention module
--disable-skip-layers
Disable the skip layers
--disable-first-bn Disable the batch normalization on the first layer
--masked-bn Use masked batch normalization instead of standard BN
--lazy-load Use lazy loading for large datasets
--global-padding Use a custom padding for global dataset
--normalize-data Normalize the input climate data to 0 mean and 1 std
--n-filters N_FILTERS
Number of filters for the first/last layer
--out-channels OUT_CHANNELS
Number of channels for the output data
--dataset-name DATASET_NAME
Name of the dataset for format checking
--min-bounds MIN_BOUNDS
Comma separated list of values defining the permitted lower-bound of output values
--max-bounds MAX_BOUNDS
Comma separated list of values defining the permitted upper-bound of output values
--profile Profile code using tensorboard profiler
--val-names VAL_NAMES
Comma separated list of netCDF files (climate dataset) for validation
--snapshot-dir SNAPSHOT_DIR
Parent directory of the training checkpoints and the snapshot images
--resume-iter RESUME_ITER
Iteration step from which the training will be resumed
--batch-size BATCH_SIZE
Batch size
--n-threads N_THREADS
Number of workers used in the data loader
--multi-gpus Use multiple GPUs, if any
--finetune Enable the fine tuning mode (use fine tuning parameterization and disable batch normalization
--lr LR Learning rate
--lr-finetune LR_FINETUNE
Learning rate for fine tuning
--max-iter MAX_ITER Maximum number of iterations
--log-interval LOG_INTERVAL
Iteration step interval at which a tensorboard summary log should be written
--lr-scheduler-patience LR_SCHEDULER_PATIENCE
Patience for the lr scheduler
--save-model-interval SAVE_MODEL_INTERVAL
Iteration step interval at which the model should be saved
--n-final-models N_FINAL_MODELS
Number of final models to be saved
--final-models-interval FINAL_MODELS_INTERVAL
Iteration step interval at which the final models should be saved
--loss-criterion LOSS_CRITERION
Index defining the loss function (0=original from Liu et al., 1=MAE of the hole region)
--eval-timesteps EVAL_TIMESTEPS
Sample indices for which a snapshot is created at each iter defined by log-interval
-f LOAD_FROM_FILE, --load-from-file LOAD_FROM_FILE
Load all the arguments from a text file
--vlim VLIM Comma separated list of vmin,vmax values for the color scale of the snapshot images
--lambda-loss LAMBDA_LOSS
Comma separated list of lambda factors (key) followed by their corresponding values.Overrides the loss_criterion pre-setting
--val-metrics VAL_METRICS
Comma separated list of metrics that are evaluated on the val dataset at log-interval
--tensor-plots TENSOR_PLOTS
Comma separated list of 2D plots to be added to tensorboard (error, distribution, correlation)
--early-stopping-delta EARLY_STOPPING_DELTA
Mean relative delta of the val loss used for the termination criterion
--early-stopping-patience EARLY_STOPPING_PATIENCE
Number of log-interval iterations used for the termination criterion
--n-iters-val N_ITERS_VAL
Number of batch iterations used to average the validation loss
usage: crai-evaluate [-h] [--data-root-dir DATA_ROOT_DIR] [--mask-dir MASK_DIR] [--log-dir LOG_DIR] [--data-names DATA_NAMES] [--mask-names MASK_NAMES] [--data-types DATA_TYPES]
[--n-target-data N_TARGET_DATA] [--device DEVICE] [--shuffle-masks] [--channel-steps CHANNEL_STEPS] [--lstm-steps LSTM_STEPS] [--gru-steps GRU_STEPS]
[--encoding-layers ENCODING_LAYERS] [--pooling-layers POOLING_LAYERS] [--conv-factor CONV_FACTOR] [--weights WEIGHTS] [--steady-masks STEADY_MASKS]
[--loop-random-seed LOOP_RANDOM_SEED] [--cuda-random-seed CUDA_RANDOM_SEED] [--deterministic] [--attention] [--channel-reduction-rate CHANNEL_REDUCTION_RATE] [--disable-skip-layers]
[--disable-first-bn] [--masked-bn] [--lazy-load] [--global-padding] [--normalize-data] [--n-filters N_FILTERS] [--out-channels OUT_CHANNELS] [--dataset-name DATASET_NAME]
[--min-bounds MIN_BOUNDS] [--max-bounds MAX_BOUNDS] [--profile] [--model-dir MODEL_DIR] [--model-names MODEL_NAMES] [--evaluation-dirs EVALUATION_DIRS] [--eval-names EVAL_NAMES]
[--use-train-stats] [--create-graph] [--plot-results PLOT_RESULTS] [--partitions PARTITIONS] [--maxmem MAXMEM] [--split-outputs] [-f LOAD_FROM_FILE]
options:
-h, --help show this help message and exit
--data-root-dir DATA_ROOT_DIR
Root directory containing the climate datasets
--mask-dir MASK_DIR Directory containing the mask datasets
--log-dir LOG_DIR Directory where the log files will be stored
--data-names DATA_NAMES
Comma separated list of netCDF files (climate dataset) for training/infilling
--mask-names MASK_NAMES
Comma separated list of netCDF files (mask dataset). If None, it extracts the masks from the climate dataset
--data-types DATA_TYPES
Comma separated list of variable types, in the same order as data-names and mask-names
--n-target-data N_TARGET_DATA
Number of data-names (from last) to be used as target data
--device DEVICE Device used by PyTorch (cuda or cpu)
--shuffle-masks Select mask indices randomly
--channel-steps CHANNEL_STEPS
Comma separated number of considered sequences for channeled memory:past_steps,future_steps
--lstm-steps LSTM_STEPS
Comma separated number of considered sequences for lstm: past_steps,future_steps
--gru-steps GRU_STEPS
Comma separated number of considered sequences for gru: past_steps,future_steps
--encoding-layers ENCODING_LAYERS
Number of encoding layers in the CNN
--pooling-layers POOLING_LAYERS
Number of pooling layers in the CNN
--conv-factor CONV_FACTOR
Number of channels in the deepest layer
--weights WEIGHTS Initialization weight
--steady-masks STEADY_MASKS
Comma separated list of netCDF files containing a single mask to be applied to all timesteps. The number of steady-masks must be the same as out-channels
--loop-random-seed LOOP_RANDOM_SEED
Random seed for iteration loop
--cuda-random-seed CUDA_RANDOM_SEED
Random seed for CUDA
--deterministic Disable cudnn backends for reproducibility
--attention Enable the attention module
--channel-reduction-rate CHANNEL_REDUCTION_RATE
Channel reduction rate for the attention module
--disable-skip-layers
Disable the skip layers
--disable-first-bn Disable the batch normalization on the first layer
--masked-bn Use masked batch normalization instead of standard BN
--lazy-load Use lazy loading for large datasets
--global-padding Use a custom padding for global dataset
--normalize-data Normalize the input climate data to 0 mean and 1 std
--n-filters N_FILTERS
Number of filters for the first/last layer
--out-channels OUT_CHANNELS
Number of channels for the output data
--dataset-name DATASET_NAME
Name of the dataset for format checking
--min-bounds MIN_BOUNDS
Comma separated list of values defining the permitted lower-bound of output values
--max-bounds MAX_BOUNDS
Comma separated list of values defining the permitted upper-bound of output values
--profile Profile code using tensorboard profiler
--model-dir MODEL_DIR
Directory of the trained models
--model-names MODEL_NAMES
Model names
--evaluation-dirs EVALUATION_DIRS
Directory where the output files will be stored
--eval-names EVAL_NAMES
Prefix used for the output filenames
--use-train-stats Use mean and std from training data for normalization
--create-graph Create a Tensorboard graph of the NN
--plot-results PLOT_RESULTS
Create plot images of the results for the comma separated list of time indices
--partitions PARTITIONS
Split the climate dataset into several partitions along the time coordinate
--maxmem MAXMEM Maximum available memory in MB (overwrite partitions parameter)
--split-outputs Do not merge the outputs when using multiple models and/or partitions
-f LOAD_FROM_FILE, --load-from-file LOAD_FROM_FILE
Load all the arguments from a text file
An example can be found in the directory demo
.
The instructions to run the example are given in the README.md file.
CRAI
is licensed under the terms of the BSD 3-Clause license.
CRAI
is maintained by the Climate Informatics and Technology group at DKRZ (Deutsches Klimarechenzentrum).
- Previous contributing authors: Naoto Inoue, Christopher Kadow, Stephan Seitz
- Current contributing authors: Johannes Meuer, Maximilian Witte, Étienne Plésiat.