In this study we are proposing the application of three machine learning algorithms: Random Forest (RF), Extratrees (ET), and Logistic Regression with regularization. Extremely Randomized Trees, or Extratrees, are a variant of the RF classifier (Geurts et al. 2006) that use the entire sample at each step with randomly picked decision boundaries (variables). Some advantages of ET against RF are: (1) ET have less computational cost, (2) the randomization makes the decision boundaries smoother, and (3) tends to avoid overfitting.
Built Up Grid
Data contain an information layer on built-up presence as derived from Sentinel1 image collections
- Source: Global Human Settlements
- Temporality: 1990, 2000 and 2014
- Format: Raster with 250 m2 resolution
Population Grid
Generated using census data combined with built-up index and aerial weights to generate the spatial distribution expressed as the number of people per cell.
- Source: Global Human Settlements
- Temporality: 1990, 2000 and 2015
- Format: raster with 250 m2 resolution
Digital Elevation model (DEM)
SRTM 90m Digital Elevation Database v4.1
- Source: NASA
- Format: raster with 90m2 resolution
City Lights
- Source: NOAA
- Temporality: 1995, 2000 and 2013
- Format: raster with 250 m2 resolution
Highways
- Source: Open Street Maps
- Temporality: starting from 2008
- Format: lines geometry
Geolocations: airports, schools, universities, worship places and hospitals
- Source: Open Street Maps
- Temporality: starting from 2008
- Format: points geometry
Water Bodies
Provides a basemap for the lakes, seas, oceans, large rivers, and dry salt flats of the world.
- Source: Esri Data and Maps
- Format: polygons geometry
- Python 3.5.2
- luigi
- psql (PostgreSQL) 9.4
- PostGIS 2.1.4
- geos
- gdal
- geopandas
- ...and many Python packages (see
requirements.txt
)
In order to run the pipeline you have to change these configuration files for the new values and run the following commands.
Configuration Files:
-pipeline/luigi.cfg
will need to be configured to run luigi
-pipeline/experiment.yaml
will need to be configured for the models and features to run
-pipeline/.env
will need to be configured to connect to databases (make a copy from pipeline/_env
)
Run the following commands:
If run locally (choose the number of workers):
python -m luigi --local-scheduler --workers 10 --module UrbanExpansion RunUrbanExpansion
If run on luigi server:
python3 -m luigi --workers 10 --module UrbanExpansion RunUrbanExpansion
Once you have set up the environment, you can start using the pipeline. The general process of the pipeline is:
- Process of downloading data
- Preprocess (to generate slope and city center)
- Inserting to db
- Generating Grids
- Generating Feature Grids
- Generating Urban Clusters
- Generating Urban Feature Grids
- Generating Features and Labels
- Run Models
- Store Models in results schema
The results schema is populated in this stage. The schema includes the tables:
- evaluations: metrics and values for each model (ex. precision@100)
- feature_importances: for each model, gives feature importance values as well as rank (abs and pct)
- models: stores all information pertinent to each model
- predictions: for each model, stores the value for each cell