A web application for exploring the relationship between cancer and nitrate in water.
2020-02-23
Cory Leigh Rahman, University of Wisconsin-Madison, Master's in GIS & Web Map Programming, GEOG 777: Capstone in GIS Development, Project 1 Report
Nitrate is a contaminant which can get into your drinking water supply because of fertilizers and other waste. Data from Wisconsin can be used to help figure out what the relationship is between nitrate in drinking water and cancer.
To see how nitrate and cancer are related, linear regression was run on state-wide data from Wisconsin. Results of the regression could reveal how much of the cancer rate can be explained by nitrate concentration in drinking water.
- Independent (Explanatory) Variable: Nitrate Concentration from water well samples (parts per million)
- Dependent Variable: Cancer Rate from census tracts (percentage of population)
The green-colored steps (Interpolation, Residuals, and Results) are the steps the end-user can interact with. The rest is done automatically by the back-end server in Python. End-users can set the Interpolation parameters, run the full analysis, and explore the results.
The flow chart visualizes the process of the analysis. The analysis starts with two inputs from 2016: cancer rates from census tracts, and nitrate concentrations from well water samples. In order to explore the relationship between these factors they must be in the same form. To get the data in the same form, the well samples were interpolated and zonal statistics were used to get the average nitrate concentration per census tract. Once both cancer rates and average nitrate concentration were averaged per census tract, finally linear regression could be run on the data.
A web application was built to automate the analysis and allow users to explore the results.
Back-end (Server-Side) Technologies Used:
Technology | Description | Used For |
---|---|---|
Flask | Python Web Framework | Server code & API |
Anaconda | Data Science Platform, Package Manager | Managing tools / dependencies |
GDAL | Geospatial Data Abstraction Library | IDW Interpolation of water wells |
rasterstats | Raster Statistics Library | Zonal statistics |
GeoPandas | Geospatial Data Python Library | Working with GeoJSON, organizing data |
statsmodels | Statistics Library | Ordinary least-squares (OLS) regression |
Front-end (Client-side) Technologies Used:
Technology | Description | Used For |
---|---|---|
Angular | JavaScript Front-End Web Framework | Code organization, TypeScript support, data binding, and built-in tools |
NPM | Package Manager | Managing tools / dependencies |
Leaflet | JavaScript Mapping Library | Displaying regression residuals on a map |
Bootstrap | Front-end UI Component Library | Page layout, input and button styling |
Other Technologies
Technology | Description | Used For |
---|---|---|
VS Code | Microsoft Visual Studio Code IDE / Text Editor | Writing all code |
colorbrewer2.org | Map Symbology Color Picker | Map color scheme |
Sketch | Vector Graphics Design Software | Creating the flow chart graphic and IDW parameters graphic |
The back-end is entirely written in Python and can be found in app.py
. Through Flask, a simple API was written which accepts an HTTP request to /analyze
with parameters.
...
@app.route('/', strict_slashes=False)
@app.route('/static', strict_slashes=False)
def main():
return render_template('index.html')
@app.route('/analyze', methods=['POST'])
def analyze():
...
Once the request is received, the following actions are taken:
- Get user inputs (Power and Smoothing for IDW)
- Perform the IDW interpolation on water well data
- Get the average nitrate concentration per census tract using Zonal Statistics
- Organize the appropriate attribute data for regression (cancer rate and nitrate concentration)
- Perform the regression analysis
- Merge the residuals back into the dataset
- Return two items: a) Regression results summary text, b) GeoJSON of residuals
The end-user is guided through the problem and shown how to explore it. The interface is split up into two main sections:
- Introduction
- Analysis
The Introduction section explains how nitrate gets in water and prompts the end-user with the question of how nitrate and cancer are related. The previously seen Analysis Flow Chart graphic is displayed for the end-user here, along with optional supplementary information.
The Analysis section provides interfaces for the inputs and results. Because the most subjective part of this analysis is the IDW (inverse distance weighted) interpolation, the end-user is allowed to input their own values for the Power (distance decay exponent) and Smoothing values. After the end-user selects their IDW inputs and runs the tool, they are presented with a map of residuals from the regression alongside a full regression summary.
The code for the front-end can be found in the folder front-end-exploring-cancer-nitrate/
After experimenting with multiple IDW inputs, the highest R-Squared value received was using the default values of "2
" for the distance decay exponent and "1
" for the Smoothing value. These inputs yielded the following results:
Using linear regression, the cancer rate can be estimated per census tract based only on the average nitrate concentration in water there. The map displays the difference (i.e. residual) between regression estimations and the actual cancer rates in each tract.
The residuals seem to be somewhat clustered around urban areas, but otherwise relatively random. With only one explanatory variable (nitrate concentration), it is not expected for the residuals to be completely random, as other explanations for cancer rates are certainly being missed.
The regression summary tells us two important things:
- What the R-Squared value is
- If we can trust the R-Squared value
The R-Squared value this analysis received was 0.32, suggesting that nitrate concentration in water can explain 32% of the cancer rate per census tract as long as the R-Squared value is trustworthy.
The other factors in the regression summary reveal that the R-Squared value is trustworthy. Because the P value (Prob F Statistic) is close to zero, and the F-statistic value is really large, we can reject the null hypothesis; this proves that there is a linear relationship between the independent and dependant variables. For the "mean" variable (mean nitrate concentration), given a large t-value (25.676) and a P-value close to zero, the null hypothesis can again be rejected, proving that nitrate concentration is significant in the prediction of cancer rate.
Given the statistical significance of the regression model, the results show that nitrate concentration in well water can explain 32% of the cancer rate in Wisconsin.
<style> img[src*='#w70'] { width: 70%; margin-left: 15%; margin-right: 15%; } </style>