not released to public yet
Input data, process data and final data can all be found on AWS S3:
s3://wri-projects/Aqueduct30
The Water Risk Atlas final data (Annual, Monthly):
s3://wri-projects/Aqueduct30/finalData/Y2019M01D14_RH_Aqueduct_Results_V01
Country rankings final data:
wri-projects/Aqueduct30/finalData/Y2019M04D15_RH_GA_Aqueduct_Results_V01
Q:
In the file, some geo units (by string_id) have a indicator labeled as “awr”. Could you explain what that is?
A:
awr in Aqueduct 3.0 stands for aggregated water risk. There are four options: tot (total), qan (quantity), qal (quality) and rrr (regulatory and reputational). In combination with an industry weighting scheme (see technical note), these represent the aggregated water risk. awr tot is also referred to as "overall water risk".
Q:
If we look at unique string_ids, why does master_geom.shp has 68,511 units, but annual_normalized.csv only has 68,365? For example, the unit (string_id: None-ALA.13_1-None) is not in the csv file.
A:
"None-ALA.13_1-None" means that it's not part of a hydroBasin nor a goundwater aquifer. It's part of the GADM level 0 (usually country) of Åland. For Åland, we don't have any country information either. We used an inner join leading to the different shapes of the data.
Q:
The number indicators each geo unit (by string_id) has are not always the same. Some of them have 14 (e.g., 434823-CHN.16_1-1626), some 13 (e.g., 296905-SAU.13_1-None), 12 (e.g., 524050-None-2096), 4 (e.g., None-AGO.2_1-2691)… Could you explain why that’s the case?
A:
This depends on data availability. The string_id uses the format hydrobasinID_GADM0ID_WHYMAPID. "None" is used when a geometry is not part of the associated geometry. The numbers that you specify are however different than what I found. For each string_id and weighting_sheme (industry), there is a maximum of 13 indicators + 3 grouped aggregated water risk + 1 total aggregated water risk score. Hence the maximum is 17.
434823-CHN.16_1-1626 (17, 10) 296905-SAU.13_1-None (16, 10) None-AGO.2_1-2691 (7, 10)
PCR-GLOBWB 2 on S3 (Geotiff): s3://wri-projects/Aqueduct30/processData/Y2017M07D31_RH_Convert_NetCDF_Geotiff_V02/output_V02
PCR-GLOBWB 2 on earthengine (ImageCollection): projects/WRI-Aquaduct/PCRGlobWB20V0/global_historical_PDomWN_month_m_5min_1960_2014
link
on S3: s3://wri-projects/Aqueduct30/processData/Y2018M12D11_RH_Master_Weights_GPD_V02
on BigQuery: aqueduct30:aqueduct30v01.y2018m12d11_rh_master_weights_gpd_v02_v10
Througout the readme, variables that you need to replace with your own variable are indicated in greater than and smaller than signs <variableYouNeedToReplace>
If you are not viewing this document on Github, please find a stylized version here
The coding environment uses Docker images that can be found here
this document explains each and every step for the data processing of Aqueduct 3.0. Everything is here, from raw data to code to explanation. We also epxlain how you could replicate the calculations on your local machine or in a cloud environment.
The overall structure is as follows:
- Data is stored on WRI's Amazon S3 Storage
- Code and versionion is stored on Github
- The Python environment description is stored in a Docker Image
- Coding and dat operation are done in Jupyter Notebooks
A link to the flowchart: https://docs.google.com/drawings/d/1IjTVlQUHNYj2w0zrS8SKQV1Bpworvt0XDp7UE2tPms0/edit?usp=sharing
Each data source (pristine data), indicated with the open cylinder on the right side, is stord on our S3 drive on the rawData folder: wri-projects/Aqueduct30/rawData
The pristine data is also copied to step 0 in the data processing folder: wri-projects/Aqueduct30/processData
A link to edit the technical setup drawing: https://docs.google.com/drawings/d/1UR62IEQwQChj2SsksMsYGBb5YnVu_VaZlG10ZGowpA4/edit?usp=sharing
There are two options to setup your working environment:
- Locally
- In the cloud (recommended)
Both options are based on Docker and Jupyter. Although you might be able to do the lion's share of the data processing on your local machine, there are good reasons to work with a cloud based solution
- mount a large harddrive to store the data. you will need appr. 300GB
- easy to pick an appropriata instance size (number of CPU's and RAM)
There are also downsides
- Additional security steps required
- Account(s) needed
- Costs
Requirements:
- The Docker image requires approximately 12GB of Storage and is not a lightweight solution.
- If you want to replicate the Aqueduct data processing steps, you will need approximately 300GB of disk space.
If you are on a windows machine, the standard command prompt is limited. I found it useful to install a custom application to replace the command line. conEmu
-
install docker cummunity edition
instructions
For windows it requires some additional steps and might require enabling Hyper-V virtualization. There are cases in which you have to enable this in your BIOS. In case of WRI Windows 10 Dell Latitude E7250 Laptops, the following links are helpful:
Manually enable Hyper-V
Troubleshoot
Adding your user to Docker -
Start docker
you can check if docker is installed by typingdocker -v
in your terminal or command prompt. If you ever got stuck in one of the next steps or closed your terminal window it is important to understand some basic docker commands. First, you need to understand the concpet of an image and a container. You can list your images usingdocker images
and you can list your active containers usingdocker ps
and all your containers usingdocker ps -a
. If your container is still running you can bash (terminal) into your container usingdocker exec -it <container name> bash
. To shut down a container, use 'exit'. Furthermore you can delete containers usingdocker rm -f <ContainerName>
and images usingdocker rmi <imageName>
. I also created a couple of cheatsheets for various tools. -
Run a Docker Container:
docker run --name aqueduct -it -p 8888:8888 rutgerhofste/docker_gis:stable bash
This will download the docker image and run a container with name aqueduct in -it mode (interactive, tty), forward port 8888 on the container to the localhost port 8888 and execute a bash script. It will be helpful to understand the basics of Docker to understand what you are doing here. Docker will automatically put your terminal or command prompt in your container. It will say root@containerID instead of your normal user. You can tell if you are in a container by the first characters in you terminal. It will state something like "root@240c3eb5620e:/#" indicating you are a root user on the virtual machine named "240c3eb5620e". The code will be different in your case. -
Setup Security certificates:
in your container create a certificate by running:
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /.keys/mykey.key -out /.keys/mycert.pem
You are asked some quesions like country name etc. which you can leave blank. Just press return a couple of times. -
Clone the Git repository You have two options here: 1) Clone the Aqueduct Repository 2) Create a so-called fork of the Aqueduct Project and work in the fork. The first option requires you to be added as a collaborator in order to be able to push your edits to the repo. The latter option allows you to work independent from the official Aqueduct repo. You will need to make a pull request to have your edits incorporated in the main repo of Aqueduct3.0.
- Option 1) Clone original Aqueduct3.0 repository:
While in your Docker Image (root@... $ )
mkdir /volumes/repos
(might already exist)
git clone https://github.com/rutgerhofste/Aqueduct30Docker.git /volumes/repos/Aqueduct30Docker/
- Option 2) Fork repository first
Fork repository on Github
Learn more about how forking works here
mkdir /volumes/repos
(might already exist)
git clone https://github.com/<Replace with your Github username>/Aqueduct30Docker.git /volumes/repos/Aqueduct30Docker/
- Option 1) Clone original Aqueduct3.0 repository:
-
Create a TMUX session before spinning up your Jupyter Notebook server
Although this is an extra step, it will allow you to have multiple windows open and allows you to detach and attac in case you lose a connection.tmux new -s aqueducttmux
. -
Split your session window into two panes using
crtl-b "
. The way TMUX works is that you presscrtl+b
, release it and then press"
. more info on TMUX. You can change panes by usingctrl-b o
(opposite). -
In one of your panes, launch a Jupyter Notebook server
jupyter notebook --no-browser --ip=0.0.0.0 --allow-root --certfile=/.keys/mycert.pem --keyfile=/.keys/mykey.key --notebook-dir= /volumes/repos/Aqueduct30Docker/ --config=/volumes/repos/Aqueduct30Docker/jupyter_notebook_config.py
-
Open your browser and go to https://localhost:8888
The standard password for your notebooks isAqueduct2017!
, you can change this later -
Congratulations, you can start running code in your browser. This tuturial continues in the section Additional Steps After Starting your jupyter Notebook server
-
Get familair with how to use Amazon (EC2) or Google Cloud (CE) virtual instances:
for this I reccomend using the tutorials that are available on Amazon's and Google's websites.
Amazon tutorial -
Use the specifics below when setting up you EC2 instance. If you miss one step, your instance will likely not work.
- In step 1) select Ubuntu Server 16.04 LTS (HVM), SSD Volume Type
- In step 2), if your budget allows, choose T2.Medium
calculate costs - In step 3) make sure
- If you are within a VPC, allow IP addresses to be set
Auto-assign Public IP = enable - Under advanced details, set user data to as file and upload the startup.sh script from the /other folder on Github.
- If you are within a VPC, allow IP addresses to be set
- in step 4) add storage
depending on the steps in the data process, we recommend setting the size to 200GB.
calculate costs - in step 5) add tags
optionally you can set a name for your instance - in step 6) Set the appropriate security rules.
This is a crucial step. Eventually we will communicate over SSH (port 22) and HTTPS (port 443). You can whitelist your IP address or allow traffic from everywhere. As a minimum you need to allow SSH and HTTPS from your IP address. If you want to do testing with HTTP you can temporarily allow HTTP (port 80) traffic. - Launch your instance
-
Connect to your instance using SSH Connect to your instance
For windows PUTTY is recommended, for Mac and Linux you can use your terminal. -
Once logged in into your system
check if docker is installeddocker version
-
download the latest docker image for aqueduct. Check https://hub.docker.com/search/?isAutomated=0&isOfficial=0&page=1&pullCount=0&q=rutgerhofste&starCount=0 run your container
docker run --name aqueduct -it -p 8888:8888 rutgerhofste/docker_gis:stable bash
-
(recommended) Set up HTTPS access in your container create a certificate by running
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /.keys/mykey.key -out /.keys/mycert.pem
and answer some questions needed for the certificate -
Optional: Setup SSH access keys:
https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/
ssh-keygen -t rsa -b 4096 -C "rutgerhofste@gmail.com"
cat /root/.ssh/id_rsa.pub
-
Clone your repo in a new folder
mkdir /volumes/repos
cd /volumes/repos
If you setup github SSH (see above):
git clone git@github.com:rutgerhofste/Aqueduct30Docker.git
otherwise:
git clone https://github.com/<Replace with your Github username>/Aqueduct30Docker.git /volumes/repos/Aqueduct30Docker/
You might have to specify credentials. -
Create a TMUX session before spinning up your Jupyter Notebook server.
Although this is an extra step, it will allow you to have multiple windows open and allows you to detach and attach in case you lose a connection.tmux new -s aqueducttmux
-
Split your session window into two panes using
crtl-b "
The way TMUX works is that you presscrtl+b
, release it and then press"
. more info on TMUX. You can change panes by usingctrl-b o
(opposite). -
Start your notebook with the certificates
jupyter notebook --no-browser --ip=0.0.0.0 --allow-root --certfile=/.keys/mycert.pem --keyfile=/.keys/mykey.key --notebook-dir= /volumes/repos/Aqueduct30Docker/ --config=/volumes/repos/Aqueduct30Docker/jupyter_notebook_config.py
-
in your browser, go to:
https://<your public IP address>:8888
You can find your public IP address on the overview page of amazon EC2. your browser will give you a warning because you are using a self created certificate. Do you trust your self created certificate?
If you trust yourself, click advanced (Chrome) and proceed to the site. The current config file is password protected. I will change to something generic in the future. If you want to change this password please see this link -
The standard password for your notebooks is
Aqueduct2017!
, you can change this later -
Congratulations you are up and running. to make most use of these notebooks, you will need to authenticate for a couple of services including using AWS and Google Earth Engine.
Let's check what we've done so far. You are now able to connect to a jupyter notebook server that either runs locally or in the cloud. In addition to your browser, you have an open terminal (or command prompt) window open with two TMUX panes. One is logging what is happening on your Jupyter notebook server, the other is idle but connected to you container. You can tell if you are in a container by the username and machine name in your window. It should say something like root@240c3eb5620e:. remember that you can switch panes by ctrl-b o
- Authenticate for AWS
In your tmux pane typeaws configure
you should now be able to provide your AWS credentials. Please ask Susan Minnemeyer if you haven't received those already.
-
Autenticate for Google Cloud SDK
similar to AWS, you might need Google Cloud acces.
gcloud auth login
-
Autenticate for Earth Engine and for earth engine (if needed, you can also do this from within Jupyter)
earthengine authenticate
The docker image comes with git intalled and is linked to the following github remote branch: https://github.com/rutgerhofste/Aqueduct30Docker
in order to commit, please run a terminal from the Jupyter main page (top right corner).
you can bash into the instance using
docker exec -it <container ID> bash
share repo on hub.docker
docker login
docker tag image username/repository:tag
e.g.: docker tag friendlyhello rutgerhofste/get-started:part1
docker push username/repository:tag
Identify yourself on the server git git config --global user.email "rutgerhofste@gmail.com" git config --global user.name "Rutger Hofste"
cleanup
check containers
docker ps -a
docker stop <containerID>
docker rm <containerID>
check images
docker images
docker rmi <imageID>
Windows remove none images FOR /f "tokens=*" %i IN ('docker images -q -f "dangling=true"') DO docker rmi %i
Safe way: run bash on docker container and use AWS configure aws configure
us-east-1
aws configure
Copy files to volume
aws s3 cp s3://wri-projects/Aqueduct30/rawData/Utrecht/yoshi20161219/waterdemand /volumes/data/ --recursive
Using Putty and want to edit a file in nano/vim:
export TERM=xterm
due to some weird bug
The javascript files for earth engine can be added to your earth engine code editor (code.earthengine.com) by using the following URL
https://code.earthengine.google.com/?accept_repo=aqueduct30
note to self: If you make changes in the online code editor and want to push to this github, use
git clone https://earthengine.googlesource.com/aqueduct30
git pull origin
Run on your instance (not in docker container)
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mykey.key -out mycert.pem
Put the private and public key in folder that matches the patch in your jupyter config file
if needed change the path in your jupyter config file
run your container
copy files to container
docker run -it -p 8888:8888 testjupyter:v01 bash
cd /usr/local/bin/
docker images -q --filter "dangling=true" | xargs -r docker rmi
docker run -it -p 8000:8000 rutgerhofste/jupyterhub:v02 bash
clone latest git repo
Create certificates
openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout /.keys/jupyterhub.key -out /.keys/jupyterhub.crt
Set environment variables
Create these values: https://github.com/settings/applications/new
export GITHUB_CLIENT_ID=from_github export GITHUB_CLIENT_SECRET=also_from_github export OAUTH_CALLBACK_URL=https://[YOURDOMAIN]/hub/oauth_callback
Run jupyterhub in folder with jupyterhub_config.py
jupyterhub
https://jdblischak.github.io/2014-09-18-chicago/novice/git/05-sshkeys.html
cd ~/.ssh ssh-keygen -t rsa -C "rutgerhofste@gmail.com"
no passphrase default folder
cat ~/.ssh/id_rsa.pub
Add on github
git clone ssh://git@github.com:rutgerhofste/Aqueduct30Docker.git
This schema was created using draw.io. File Location: other/ERD.xml