Description
The toolset presented in this repository aims to predict the possibility that a potential settlement is a true positive (an actual settlement) or a false positive (a non-settlement area that is mistaken as a settlement during the generation stage), so as to improve the accuracy of settlement identification work.
The generation of candidate settlements is based on image analysis of high-resolution satellite imagery data, which will not be covered in this repo. The toolset here will take in potential settlements and help predict which ones are likely to be false positives. False negatives (actual settlements that are missed in the generation stage) will not be discovered, so it is preferable to be more lenient during the generation process and allow more candidates to be considered in this subsequent filtering stage. It is worth noting that this toolset is developed with small settlements (hamlet or village level) in mind, taking advantage of their relatively smaller scope and simpler geometric shape. However, components of the toolset can be transferred to the filtering of larger settlements, or other related geospatial analysis.
This toolset is developed by Guangyu (Tim) Wu during his research assistantship at Columbia University CIESIN GRID3 project. Special thanks to Jolynn Schmidt and Hasim Engin for their feedback and guidance during the development process, as well as to colleagues at GRID3 for creating the initial data products.
Notebooks
- Data Collection (OSM)
- Feature Engineering (Land Cover, Land Use, Road Connectivity, Dispersion, Building Features)
- False Positive Classification with Machine Learning (XGBoost)
Results
After hyperparameter tuning, the trained XGBoost classifier achieves an F1 score of 0.8 and AUC of 0.854 on the test set.
The trained model are then applied to predict the false-positive probability of all 650,000+ hamlet settlements in Zambia. Visualizations on national, regional, and local scales show that the probability estimation is plausible and useful in identifying likely false positives for verification.
National distribution of possible false-positive hamlets (red means higher probability) |
Regional visualization of possible false-positive hamlets (note the hamlets in swamp areas) |
Example of false-positive hamlets found during visual inspection of hamlets with high probability |
List of utility functions
Function read_csv_as_gpd
Parameters:
df_or_filepath
, GeoDataframe or strid_column
, str (default = None, if not specified, a UUID column will be generated)lon_lat_columns
, list of str (default = [])attribute_columns
, list of str (default = [])drop_rows_where_duplicates_in_columns
, list of str (default = [])keep_which_if_duplicates
, str (default = 'first', options: 'first','last')drop_rows_where_nan_in_columns
, list of str (default = [])
Returns:
- A GeoDataframe loaded from a CSV file. Use
drop_rows_where_duplicates_in_columns
in combination withid_column
andkeep_which_if_duplicates
to keep control the behavior when duplicates are detected in certain columns. Usedrop_rows_where_nan_in_columns
to control the dropping of rows with missing values.
Function explode_geometry
Parameters:
gdf
, GeoDataframegeom_colum
, str (default = 'geometry')id_column
, str (default = None, must specify if using drop_duplicates)drop_duplicates
, boolean (default = False)
Returns:
- A GeoDataframe with geometry column exploded into single-polygon/single-linestring objects. Use drop_duplicates in combination with id_column to keep only the the largest shape among the shapes with the same original id.
Function left_spatial_join
Parameters:
gdf1
, GeoDataframegdf2
, GeoDataframe
Returns:
- A GeoDataframe resulting from the left spatial join of two input GeoDataframes. Spatial join operation uses "intersection", right index dropped.
Function drop_bounds
Parameters:
gdf
, GeoDataframegeom_column
, str
Returns:
- A GeoDataframe with geometry bounds columns dropped, these include the columns that start with a geometry column name and end with
minx
,maxx
,miny
, ormaxy
. This is a utility function to help similify output ofget_raster_value_distribution
function.
Function add_buffer_column
Parameters:
gdf
, GeoDataframebuffer_radius
, int (in meters)geom_column
, str (default = 'geometry')buffer_shape
, str (default = 'round', options: 'round','square','flat')proj2
, str (default = None, options: any valid crs)replace
, boolean (default = False)return_new_column
, boolean (default = False)
Returns:
- A GeoDataframe with a new buffer column. Set the geometry from which to buffer using
geom_column
. Set buffer radius withbuffer_radius
, units are in meters, the radius can be negative for shrinking shapes, though multiple shapes may be created as a result. Control shape of buffer withbuffer_shape
. If you want the buffer to be based on a projection other than current projection, useproj2
, it will not affect the projection of the input dataframe, it only applies to the new buffer column. If you want to replace the main geometry column with the newly created buffer column, setreplace
to True. If you want to get the name of the newly created buffer column, setreturn_new_column
to True.
Function add_centroid_column
Parameters:
gdf
, GeoDataframegeom_column
, str (default = 'geometry')proj2
, str (default = None, options: any valid crs)replace
, boolean (default = False)return_new_column
, boolean (default = False)
Returns:
- A GeoDataframe with a new centroid column. Set the geometry from which to calculate centroid using
geom_column
. If you want the centroid to be based on a projection other than current projection, useproj2
, it will not affect the projection of the input dataframe, it only applies to the new centroid column. If you want to replace the main geometry column with the newly created centroid column, setreplace
to True. If you want to get the name of the newly created centroid column, setreturn_new_column
to True.
Function add_intersection_count_column
Parameters:
gdf
, GeoDataframeuuid_column
, strbuffer_column
, strfeature_layer
, strnew_column
, strfeature_geom_column
, str (default = 'geometry')
Returns:
- A GeoDataframe with a new column that counts the feature geometries within the buffer of main geometry.
gdf
is the main GeoDataframe that has a buffer column, specified bybuffer_column
.feature_layer
is the other GEoDataframe with features, by default the 'geometry' column of the feature layer will be used but it can be changed.new_column
controls the name of the newly-created column.
Function add_distance_to_nearest_neighbor_column
Parameters:
gdf
, GeoDataframegeom_centroid_column
, strnew_column
, strrounding
, str (default = 0)
Returns:
- A GeoDataframe with a new column that calculate the distance from this geometry to the nearest geometry within the same GeoDataframe. Use
geom_centroid_column
to specify which geometry to do nearest distance calculation.new_column
controls the name of the newly-created column. The distance is measured in meters and rounded by default, but can be changed withrounding
parameter.
Function add_covering_geotiff_column
Parameters:
gdf
, GeoDataframegeom_column
, strgeotiff_filepath_column
, strgeotiff_filepath_list
, list of str
Returns:
- A GeoDataframe with a new column storing the filepaths of the geotiff that cover the shapes in each row. Set the geometry with which to find geotiff using
geom_column
. Set the name of the new geotiff_filepath column withgeotiff_filepath_column
. Provide the filepaths of the candidate geotiffs ingeotiff_filepath_list
. All parameters need to be explicitly specified.
Function get_raster_value_distribution
Parameters:
gdf
, GeoDataframeid_column
, strgeom_column
, strgeotiff_filepath_column
, strgeotiff_filepath_list
, list of strcode_to_label_mapping
, list of strlabel_marker
, list of strnormalize
, boolean
Returns:
- A GeoDataframe with new columns corresponding to the distribution of different codes in the covering raster image. Specify which raster geotiff is covering the geometry with
geotiff_filepath_column
. Provide the filepaths of the candidate geotiffs ingeotiff_filepath_list
. Usecode_to_label_mapping
to specify the mapping from numerical codes to human-readable labels, this may vary from one standard to another.label_marker
is a prefix to all the newly-created column, so as to mark which raster these columns are derived from.normalize
controls whether the values in the columns are proportion or absolute count of pixels. All parameters need to be explicitly specified.
Function get_groupby_stats_df
Parameters:
data
, Dataframe or GeoDataframegroupby_column
, strstats_map
, dict
Returns:
- A Dataframe with statistics of the provided features, a simple wrapper around Pandas
groupby
function.
Function get_most_correlated_feature
Parameters:
data
, Dataframe or GeoDataframetarget
, strfeatures
, list of str
Returns:
- The name of feature that is most correlated with the target, as measured by Pearson R. This is a simple utility function for choosing one feature when several features are highly correlated with each other.
Function within_value_to_range_value
Parameters:
gdf
, GeoDataframebuffer_radius_markers
, str
Returns:
- A GeoDataframe with new columns tracking the count of features in the ring areas around main geometries. For example, the number of features in the ring area that is at least 500 meters away but at most 5000 meters away from a settlement. This is based on the observation that number of features within 5000 meters must include the number of features within 500 meters, which creates collinearity that hurts prediction models. Thus, this function calculates the count in a specific range instead of the count within a radius.