/SematicsOfStats

The varying semantics of statistical modelling: the example of regression

Primary LanguageR

Sematics of Statistical models

This repository contains data and code to reproduce the analysis described in a paper published in Geographical Analysis in May 2019 (https://onlinelibrary.wiley.com/doi/full/10.1111/gean.12199).

The code is in the file Semantics_code_data_git.R and this loads data from this repository. Please contact Lex Comber a.comber@leeds.ac.uk if you have any questions.

Paper title: The forgotten semantics of regression modelling in Geography

Alexis Comber1, Paul Harris2, Yihe Lü3, Lianhai Wu2 and Pete Atkinson4

1School of Geography, University of Leeds, Leeds, UK LS2 9JT
2Rothamsted Research, North Wyke, Okehampton, Devon, UK EX20 2SB
3Chinese Academy of Sciences, Beijing, 100085, China
4Faculty of Science and Technology, Lancaster University, UK

Abstract

This paper is concerned with the semantics associated with the statistical analysis of spatial data. It takes the simplest case of the prediction of y as a function of x, in which predicted y is always an approximation of y and is always, and can only ever be, a function of x, and illustrates a number of core issues using ‘synthetic’ remote sensing and ‘real’ soils case studies. Specifically, the outputs of regression models and therefore the meaning of predicted y, are shown to vary due to 1) choices about data: specification of x (which covariates to include), the support of x (measurement scales, granularity), the measurement of x and the error of x, and 2) choices about the model including its functional form and the method of model identification. Some of these issues are more widely recognised than others. The case studies illustrate the effects of data and model choices and their impacts on model outputs. The study provides definition to the multiple ways in which prediction from regression may be affected and shows how regression prediction and inference are affected by data and model choices. The paper invites researchers to pause and consider the implications of predicted y being nothing more than a scaled version of a single covariate, inheriting the same spatial correlation, and argues that it is naïve to ignore this.

Acknowledgements

This research was supported by the China-UK bilateral collaborative research on critical zone science (the Natural Environment Research Council Newton Fund NE/N007433/1, the National Natural Science Foundation of China NO. 41571130083) and the National Key Research and Development Program of China (No. 2016YFC0501601). All of the data preparation, analyses and mappings were undertaken in R 3.5.1, the open source software.