/deep-early-warnings-pnas

Repository to accompany the publication 'Deep learning for early warning signals of tipping points', PNAS (2021)

Primary LanguagePythonOtherNOASSERTION

Deep learning for early warning signals of tipping points

This repository contains code to accompany the publication

Deep learning for early warning signals of tipping points. Thomas M. Bury, R. I. Sujith, Induja Pavithran, Marten Scheffer, Timothy M. Lenton, Madhur Anand, Chris T. Bauch. Proceedings of the National Academy of Sciences 2021, 118 (39) e2106140118; DOI: 10.1073/pnas.2106140118

Requirements

Python 3.7 is required. To install python package dependencies, use the command

pip install -r requirements.txt

within a new virtual environment.

The bifurcation continuation software AUTO-07P is required. Installation instructions are provided at http://www.macs.hw.ac.uk/~gabriel/auto07/auto.html

Directories

./dl_train: Code to train the DL algorithm

./figures_pnas: Code to generate figures used in manuscript.

./test_models: Code to simulate and compute early warning signals for the test models. These include the Consumer-Resource model, May's harvesting model and an SEIRx model.

./test_empirical: Code to pre-process and compute early warning signals for the empirical datasets (see below).

./training_data: Code to generate training data for the deep learning algorithm.

Workflow

The results in the paper are obtained from the following workflow:

  1. Generate the training data. We generate two sets of training data. One for the 500-classifier and the other for the 1500-classifier. The 500-classifier (1500-classifier) is trained on 500,000 (200,000) time series, each of length 500 (1500) data points. Run

    bash training_data/run_single_batch.sh $batch_num $ts_len

    where $batch_num is a batch number (integer) and $ts_len is a time series length (500 or 1500). This generates 4,000 time series, consisting of 1000 time series for each possible outcome (fold, Hopf, transcritical, Null). Each time series is saved as a csv file. This alone can take up to 1000 minutes (~17 hours) on a single CPU. We therefore run multiple batches in parallel on a CPU cluster at the University of Waterloo. This cluster uses the Slurm workload manager. The script to submit the 125 batches (125x4000=500k time series) for the 500-classifier is submit_multi_batch_500.py. The script to submit the 50 batches (50x4000=200k time series) for the 1500-classifier is submit_multi_batch_1500.py.

    Once every batch has been generated, the output data from each batch is combined using

    bash combine_batches.sh $num_batches $ts_len

    where $num_batches is the total number of batches being used. This also stacks the labels.csv and groups.csv files, and compresses the folder containing the time series data.

    The final compressed output comes out at 5.87GB for the 500-classifier and 5.38GB for the 1500-classifier. Both datasets are archived on Zenodo at https://zenodo.org/record/5527154#.YU9SuGZKhqs.

  2. Train the DL classifiers. 20 neural networks are trained using the training data. 10 networks use time series that are padded on both the left and the right with zeros (model_type=1). 10 networks use time series that are padded only on the left with zeros (model_type=2). Results reported in the paper take the average prediction from these 20 networks. To train a single neural network of a given model_type and index kk, run

    python ./dl_train/DL_training.py $model_type $kk

    This will export the trained model (including weights, biases and architecture) to the directory ./dl_train/best_models/. We run this for model_type in [1,2] and kk in [1,2,3,...,10]. Taking model_type and kk as command line parameters allows training of multiple neural networks in parallel if one has access to mulitple GPUs. The code currently trains networks using the 1500-length time series. To train using the 500-length time series, adjust the parameters lib_size and ts_len, as indicated in the comments of the code. Time to train a single neural network using a GPU is approx. 24 hours.

  3. Generate model data for testing. Simulate the models used to test the DL algorithm. Code to do this is available in ./test_models/. For example, to simulate trajectories of May's harvesting model going through a Fold bifurcation, run

    python ./test_models/may_fold_1500/sim_may_fold.py

    Null trajectories can be generated by running sim_may_fold_null.py. The same file notation is used for the other models. These scripts also detrend the time series to get residuals, and compute the variance and lag-1 autocorrelation.

  4. Process empirical data. Scripts to process the empirical data (including sampling and detrending) are availalble in the directory ./test_empirical/. Residuals are obtained and variance and lag-1 autocorrleation are computed.

  5. Generate DL predictions. We now feed the residual data into the neural networks to make predictions. This may be done for a single residual time series by running

    python ./dl_train/DL_apply.py
    

    This code takes in a residual time series at filepath, and exports the averaged predictions as a csv file to filepath_out. The variables filepath, filepath_out and ts_len should be modified to suit the residual data and the type of classifier being used (1500 or 500-classifier). The script currently computes predictions made by the 1500-classifier on residual data from May's harvesting model. If using your own data, it is important to detrend it using a Lowess filter with span 0.20, as is done for the training data.

  6. Compute ROC statistics. To compare the performace of the DL classifiers vs. variance and lag-1 autocorrelation, we compute ROC statistics. Each directory within ./test_models/ and ./test_empirical/ has a script called compute_roc.py which does this.

Data sources

The empirical data used in this study are available from the following sources:

  1. Sedimentary archive data from the Mediterranean Sea are available at the PANGAEA data repository. Data were preprocessed according to the study Hennekam, Rick, et al. "Early‐warning signals for marine anoxic events." Geophysical Research Letters 47.20 (2020): e2020GL089183.
  2. Thermoacoustic instability data are available in this repository here. Data were collected by Induja Pavithran and R. I. Sujith and were first published in Pavithran, I. and Sujith, R.I. "Effect of rate of change of parameter on early warning signals for critical transitions" Chaos: An Interdisciplinary Journal of Nonlinear Science 31.1 (2021): 013116.
  3. Paleoclimate transition data are available from the World Data Center for Paleoclimatology, National Geophysical Data Center, Boulder, Colorado. Data were preprocessed according to the study Dakos, Vasilis, et al. "Slowing down as an early warning signal for abrupt climate change." Proceedings of the National Academy of Sciences 105.38 (2008): 14308-14312.

License

Shield: CC BY-NC-SA 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

Copyright © 2021, Thomas Bury (McGill University), Chris Bauch (University of Waterloo), Madhur Anand (University of Guelph)