This software aims to delineate spatial regimes (geographically connected regions with varying coefficients across regions) in the context of linear regression models. Please refer to our paper for more details:
Hao Guo, Andre Python & Yu Liu (2023) Extending regionalization algorithms to explore spatial process heterogeneity, International Journal of Geographical Information Science, 37:11, 2319-2344, DOI: 10.1080/13658816.2023.2266493
- p.2336 "For multiple sales at the same location, we only retain the latest record." This involves ambiguity as several pairs of sales have the same location and date. One possible solution is to retain the sale with higher price. We have reproduced the King County experiment in this setting, verifying that this issue does not affect the main finding (lower SSR for K-Models than Skater-reg).
The software is built on Python 3.9 (or higher versions) as well as several Python packages.
-
Necessary
- numpy, matplotlib, networkx
- scikit-learn
- geopandas
- libpysal, mgwr, spopt, spreg
- xlrd, xlwt
-
Optional
- statsmodels (significance testing of linear equations)
src/Algorithm9.py
: Implementation of the regime delineation algorithms, including the two-stage K-Models (KM), AZP, Regional-K-Models (RKM). We also incorporate the interfaces of GWR-Skater (GSK, from mgwr/spopt) and Skater regression (SKR, from spreg) for comparison.src/Network9.py
: tool functions dealing with the adjacency network, region contiguity, etc.src/GridData9.py
: generating and handling simulated grid data.
synthetic/grid_*.txt
: three synthetic datasets on 25*25 grid, each with 50 simulations (dataid
0-49: Rectangular; 50-99: Voronoi; 100-149: Arbitrary). Each simulation contains three samples under different noise levels ('h' for high, 'm' for medium, 'l' for low).
We also include several simulations on 10*10 grid for debugging codes (synthetic/gridtest_*.txt
).
src/grid_gen.py
: data generationsrc/grid_reg.py
: spatial regime delineation with five considered algorithmssrc/grid_K.py
: experiment on the choice of parameter K (number of micro-clusters) in two-stage K-Modelssrc/grid_stab.py
: experiment on the algorithm stability of two-stage K-Models, AZP and Regional-K-Models
src/georgia.py
: spatial regime delineation on the Georgia Dataset, using five considered algorithmssrc/kinghouse.py
: spatial regime delineation on the King County house price Dataset, using two-stage K-Models and Skater-reg
kmodels(Xarr, Yarr, K, w, max_iter=10000, min_size=None, init_stoc_step=True, verbose=False)
The first stage (partition stage) of the two-stage K-Models algorithm for spatial regime delineation.
-
Parameters:
-
Xarr (
numpy.ndarray
): 2D array of explanatory variables with size$n\times m$ .$n$ is the number of spatial units,$m$ is the number of features (If the model includes an intercept, there should be a feature where all units has value 1). -
Yarr (
numpy.ndarray
): 1D array of the response variable with size$n$ . -
K (
int
): number of micro-clusters in the partition stage. -
w (
pysal.weights.W
): spatial contiguity matrix of the$n$ areal units. The order of units must be consistent withXarr
andYarr
. -
max_iter (
int
): maximum number of iterations. -
min_size (
int
): minimum number of spatial units in each micro-cluster. IfNone
, the default value of$m$ is used. -
init_stoc_step (
bool
): the way of region growing in initialization. IfFalse
, all neighboring units of current region are merged into it. Otherwise, a randomly chosen neighboring unit is assigned into it in each step. This increases the diversity of initial solutions, yet may be time-consuming for large datasets. -
verbose (
bool
): IfTrue
, there will be additional output in the algorithm process.
-
Xarr (
-
Returns:
-
label (
list
): the region label for each unit, with length$n$ . The$u$ th unit is assigned to the region with indexlabel[u]
. The region indices are integers from 0 to$K-1$ . -
iters (
int
): the number of iterations.
-
label (
The second stage (merge stage) of the two-stage K-Models algorithm is implemented in two functions. split_components
breaks disconnected regions into branches, and greedy_merge
merges the branches to achieve the required number of regions. They should be called sequentially.
split_components(w, clabel)
-
Parameters:
-
w: see
kmodels
. -
clabel (
list
):label
fromkmodels
.
-
w: see
-
Returns:
-
rlabel (
list
): the region label after split components, with length$n$ .
-
rlabel (
greedy_merge(Xarr, Yarr, n_regions, w, label, min_size=None, verbose=False)
-
Parameters:
-
Xarr, Yarr, w, verbose: see
kmodels
. -
n_regions (
int
): the number of output regions. -
label (
list
):rlabel
fromsplit_components
. -
min_size (
int
): minimum number of units in each output region. IfNone
, the default value of$m$ is used. It may be different frommin_size
inkmodels
.
-
Xarr, Yarr, w, verbose: see
-
Returns:
-
rlabel (
list
): the final region label for each unit, with length$n$ . The region indices are integers from 0 ton_regions-1
. -
coeffs (
list
): estimated coefficients for each region.coeff[r]
is a list of length$m$ , containing$m$ coefficients corresponding to the features inXarr
. -
merges (
int
): the number of merge operations.
-
rlabel (
azp(Xarr, Yarr, n_regions, w, max_iter=10000, min_size=None, init_stoc_step=True)
AZP algorithm for spatial regime delineation.
- Parameters:
- Xarr, Yarr, w, max_iter, init_stoc_step: see
kmodels
. - n_regions, min_size: see
greedy_merge
.
- Xarr, Yarr, w, max_iter, init_stoc_step: see
- Returns:
- label: same as
rlabel
ingreedy_merge
. - coeffs: see
greedy_merge
. - iters (
int
): the number of iterations.
- label: same as
region_k_models(Xarr, Yarr, n_regions, w, max_iter=10000, min_size=None, init_stoc_step=True):
Regional-K-Models algorithm for spatial regime delineation.
- Parameters:
The same withazp
. - Returns:
The same withazp
.
gwr_skater(Xarr, Yarr, n_regions, w, coord, min_size=None)
GWR-Skater for spatial regime delineation.
-
Parameters:
-
Xarr, Yarr, n_regions, w, min_size: see
azp
. -
coord (
list
): coordinates of the spatial units. Each of the$n$ elements is a coordinate pair (tuple of length 2) indicating the position of a unit.
-
Xarr, Yarr, n_regions, w, min_size: see
-
Returns:
-
label, coeffs: see
azp
.
-
label, coeffs: see
skater_reg(Xarr, Yarr, n_regions, w, min_size=None, verbose=0)
Skater regression for spatial regime delineation.
- Parameters:
- Xarr, Yarr, n_regions, w, min_size: see
azp
. - verbose (
int
): see document of spreg.skater_reg.
- Xarr, Yarr, n_regions, w, min_size: see
- Returns:
- label, coeffs: see
azp
.
- label, coeffs: see
Simulated data used in our experiments are provided in synthetic/grid_*.txt
. The file structure for each sample is: the array of grid_reg.py
automatically for spatial regime delineation.
If you use provided data, go straightly to step 2. Otherwise, use the script grid_gen.py
to generate new data samples. Open grid_gen.py
and change the following parameters if needed:
- Side : number of cells in each row and column of the grid.
- repeat : number of simulations in each dataset (Rectangular, Voronoi, Arbitrary).
- min_region : minimum number of units in each region.
Then run the script. Files named grid_*.txt
will be produced. Note that three samples with different noise levels are generated in each simulation. Move these files to the synthetic
directory for future use. The script will also show the latent regions of one simulation for each dataset, which produced Figure 2.
Use the script grid_reg.py
to perform spatial regime delineation with the five considered algorithms. Open grid_reg.py
and change the following parameters if needed:
- Side : number of cells in each side of the grid. Should be consistent with input data.
- ids, idt : index range of input data. The code will process data with id from
ids
toidt
-1. - prefix : 'edis_' or 'edistest_'. Should be consistent with input data.
- micro_clusters : the parameter K in the two-stage K-Models algorithm.
- min_region : minimum number of units in each region.
- recordnum : the number for this experiment, used to label output files.
The script will generate a log file named RG(recordnum).txt
. It records five lines for each sample, which is the summary for KM, AZP, RKM, GSK, SKR, respectively. Each line includes: total sum of squared residuals (SSR); number of true positives (TP), false negatives (FN), false positives (FP), true negatives (TN), Rand index; entropy of true regions, entropy of reconstructed regions, mutual information (MI), normalized mutual information (NMI); mean absolute error (MAE) of
For each dataid
, the program outputs:
-
A TXT file named
result_(recordnum)_(dataid).txt
. The result of low noise data comes first, then medium and high noise, each contains results from five algorithms. For each algorithm, the file first records map of the delineated regions (aSide*Side
matrix containing region index of each unit), then the estimated coefficients (the$r$ -th line is the intercept and linear coefficients for the$r$ -th region). Figure 4 is a visualization of the delineated regions recorded in these files. -
A PNG file named with
recordnum
,dataid
and time, visualizing the region delineation at each noise level.
To reproduce Figure A1a, run the two-stage K-Models, AZP, and Regional-K-Models algorithms repeatedly on the same sample of data, using the script grid_stab.py
. The usage is the same with grid_reg.py
except that dataid
and repeat
(number of repeated runs) need to be specified rather than ids
and idt
.
To reproduce Figure A1b, run the two-stage K-Models algorithm with different grid_K.py
. The usage is the same with grid_reg.py
except that idlist
(a list of dataid
) needs to be specified rather than ids
and idt
. You may also change mclist
, which specify the range of
The Georgia dataset is available as sample data of the MGWR software. We aggregated polygons belong to the same county into multi-polygons, making it easy to generate PySAL spatial weights.
The script src/georgia.py
run the five considered algorithms on the Georgia dataset. You may change the following parameters:
-
pmin, pmax : the range of number of regions. The code runs the algorithms for each
$p$ in [pmin
,pmax
]. -
Kfac : the
$K$ parameter in the two-stage K-Models is calculated asKfac
$\times p$ . - runs : number of repetitions. Increasing this parameter may find better solution, with the cost of more computation time.
- numid : number of the experiment, used to label output files.
- min_region : minimum number of units for each region. Should be at least 5 in this analysis for valid significance test.
After running the script, results are recorded in two output files:
-
Georgia_(numid).txt
: the log file. It records the results at each number of regions$p$ (frompmin
topmax
). For each$p$ , in the order of KM, AZP, RKM, GSK, and SKR, it first lists total SSR, execution time, number of iterations and merge operations (if applicable) for each run. Then for the best solution (with the lowest SSR), it records the coefficients and F-test results. Each line contains: region index, number of units, estimated coefficients, the F-value and p-value of the overall F-test. Figure 5 visualizes the total SSR from the five considered algorithm at different$p$ values. -
Georgia_(numid).xls
: an Excel workbook. For each county, it records the region it belongs to, as well as the estimated coefficients of that region. For each of the five algorithms, only the best solution is recorded.
To visualize the results, join the output table Georgia_(numid).xls
with the polygon Shapefile via the Area_key/AreaKey
field with any standard GIS software (e.g. QGIS). After that, Figure 6 and Figure 7 can be rendered in a GIS environment using the joined Shapefile.
The King County house price dataset is available from GeoDa center. The data processing procedure is described in our paper.
The script src/kinghouse.py
run the two-stage K-Models and Skater regression algorithms on the King County dataset. You may change the parameters including pmin, pmax, Kfac, runs, numid, and min_region. The meanings of these parameters are the same with those in src/georgia.py
. Note that min_region
should be at least 18 in this analysis for valid significance test.
After running the script, results are recorded in two output files:
Kinghouse_(numid).txt
: the log file. The file structure is the same with the log file for the Georgia dataset.Kinghouse_(numid).xls
: an Excel workbook. For each house location, it records the region it belongs to.
To visualize the results, join the output table Kinghouse_(numid).xls
with the point Shapefile via the id
field with any standard GIS software (e.g. QGIS). After that, Figure 8 can be rendered in a GIS environment using the joined Shapefile.