Hydro-analytics Paper Reproducibility Project

This is a clone of the repository that hosts the code of a hydrology-related research paper, published by Sadler et al. 2018 (https://doi.org/10.1016/j.jhydrol.2018.01.044). This repo serves as a use case of setting up and running a research reproducibility workflow.

The following changes & additions were made:

Debug and modified python, Jupyter Notebooks, and R codes to determine dependencies and fix broken links to data.
Addition of docker containers for 4 processes (2 parallel for data analytics and 2 subsequent serial ones for final model generation).
Automated the entire workflow process using dockers.

Goals of this reproducibility exercise where the following:

Containerize tools - determine system interdependencies
Utilize Github and share resources
Workflow parallelization
Reproducibility across platforms

Reproducibility Instructions

Starting a Jetstream instance is recommended: after logging in, select Ubuntu 18.04 Devel and Docker instance, m1.medium (CPU: 6, Mem: 16 GB, Disk: 60 GB) size and launch. When your instance is activated, you can use the web shell or $ ssh <VM's IP> in your command line. You can skip step 0 of Methods 1 and 2 if you choose to do this.

There are 3 methods that can be followed to reproduce the Sadler et al. 2018 paper's results:

All 3 methods will need that you can clone this github to your local machine.
$ git clone https://github.com/cyber-carpentry/group6.git
These methods will refer to PATH TO GITHUB REPOSITORY which is the path from your home directory to the github repository on your local machine. Enter $ pwd in the command line when you are in the github repository to see the path.

Method 1: Automated with pre-built Docker Images

This method requires Docker to be installed on your machine. The getting started guide on Docker has detailed instructions for setting up Docker on Mac/Windows/Linux.
Run all the Dockers using $ sh all.sh in the command line.

This command should be run in the main github directory (PATH TO GITHUB REPOSITORY).

Completed, flood_events.csv, nor_daily_observations.csv, and for_model_avgs.csv should be created in the db_scripts folder. Additionally, 4 files poisson_out_test.csv, poisson_out_train.csv, rf_out_test.csv, rf_out_train.csv will be generated in the /models folder.

Method 2: Build Docker Image

This method requires Docker to be installed on your machine.
Build Docker image
$ docker build -t flood_pred .

This command should run in the main GitHub directory (PATH TO GITHUB REPOSITORY)
There will be a file called Dockerfile in this directory (use command $ ls to check that out).

Run the Docker image using
$ docker run -v PATH TO GITHUB REPOSITORY/group6:/group6 flood_pred
Completed, flood_events.csv, nor_daily_observations.csv, for_model_avgs.csv and other files should be created in the db_scripts and models folders.

Method 3: Running Manually

This method requires creating both the python and R enviroment for running the scripts.

Python 2.7.16 was used and the required python modules with there verisons can be seen in requirements.txt
R 3.5.1 was used with caret, ggfortify, ggplot2, dplyr, RSQlite, DBI, class, and randomForest packages. All packages were installed from the http://cran.rstudio.com/ repo.
We recommend using conda to install the required enviroment with instructions on how to do that are below

Change to parent directory of repository directory
$ cd PATH TO REPOSITORY/..
Get file from Hydroshare
$ wget https://www.hydroshare.org/resource/9e1b23607ac240588ba50d6b5b9a49b5/data/contents/hampt_rd_data.sqlite

This will download a file in the current directory (Outside respository directory)

Go to database scripts directory
$ cd group6/db_scripts
Run python script to process street floods data
$ python prepare_flood_events_table.py

This will create flood_events.csv

Run python scirpt to process enviromental data
$ python make_dly_obs_table.py

This will create nor_daily_observations.csv

Run python script for combining flood and enviromental data
$ python by_event_for_model.py

This requires flood_events.csv and nor_daily_observations.csv and creates for_model_avgs.csv

Change to the model directory
$ cd models
Run R scripts for analysis
$ Rscript final_model_output_script.R
Four files, poisson_out_test.csv, poisson_out_train.csv, rf_out_test.csv, rf_out_train.csv, will be generated in the /models folder.

As another small but time-consuming alternative is to create CONDA environment to run the entire workflow using CONDA based fixed environment.

Instructions to create CONDA environment:

Starting a Jetstream instance is recommended: after logging in, select Ubuntu 18.04 Devel and miniconda m1.medium (CPU: 6, Mem: 16 GB, Disk: 60 GB) size and launch. If miniconda is not available you can download from https://docs.conda.io/en/latest/miniconda.html
$ git clone https://github.com/cyber-carpentry/group6.git \ cloning the git repository \
$ conda create --name --file hydro_make.yml \ creating a new environment\
$ source activate hydro \ Activating the environment\
This enviroment can be used to run method 3

Completed, you should see flood_events.csv, nor_daily_observations.csv, and for_model_avgs.csv should be created in db_scripts. Additionally 4 files poisson_out_test.csv, poisson_out_train.csv, rf_out_test.csv, rf_out_train.csv will be generated in /models.

The team members that led this effort were participants of the Cyber Carpentry 2019 workshop at the University of North Carolina at Chapel Hill. Note: You can access the original repository here.