/Modeling-on-sparse-dense-covariates

For spatial/geographic process modeling

Primary LanguagePython

On the discussion of sparse measurements interpolation with dense grid covariate(s)


0 Motivation

The attempt to recover a complete continuous random field based upon only a few point measurements/observations is fascinating! Humans are taking such attempt to almost any scale in the pursuit of science and art. Using few boreholes for a geographer to Krige the distribution of gold is no different from a sculptor using a point machine to sample out the shape of Bernini's masterpiece The Rape of Proserpina. For both science and art, accuracy is the prime. While accuracy is conveyed implicitly through the vividness of art masterpieces, scientists are by no means to be less hypercritical and obsessing about the nature of the accuracy under different conditions.

1 Background

In geoscience, using point data to make inference about the underlying continuous field remains a popular topic as we can not afford to densely sample the field at any scale. Quite counterintuitively, along a few decades of history in applying sophisticated techniques, such as interpolation, in field reconstruction through points, much of the discussion still centers around some basic issues about uncertainty in terms of inference accuracy or error. For example, many practitioners only have insufficient understanding about the behavior of field reconstruction and proper measurement of the accuracy as mentioned in the paper IS THE ORDINARY KRIGING VARIANCE A PROPER MEASURE OF INTERPOLATION ERROR? by Heuvelink & Pebesma, 2002. In this context, I found it wouldn't hurt to continue such kind of basic discussion in order to ensure the continuation of the fascinating field reconstruction through point observation. Stay within the domain of geoscience, I would start to touch on this topic for both scientific and educational purpose.

This snippet of scientific experiment is fully devoted to examine the rationale of field reconstruction through point observation from two major aspects: (1) what is the accuracy pattern under different conditions, and (2) how to capture the accuracy. You can imagine accuracy as a function of several properties of observation at hand: quality (e.g. in terms of noise intensity), density, distribution, variation, etc. Then there are several scenarios of field reconstruction accuracy depending on the combination of observation properties, which are well worth understanding. Exhaustively exploration of these scenarios is fundamental to understand the potentials and limitations of observations in reconstructing the underlying continuous field(s), or at least have an impression of reliability of using limited observations at hand, in real life cases.

2 Recap of the basics

The default geostatistical methods to reconstruct continuous field from point observations is Kriging, and there are many variations of Kriging. In the simplist case, Kriging assumes unobserved values along the continuous field are linear combination of the observed values. The key step is to estimate the linear weights that combine the observations for making predictions. The criteria of weights estimation is that the weights should ensure a minimized variance of predicted values, which results in

λ = C-1c

where λ is the vetor containing estimated weights. If someone is given with observations Y, the expectation value of the field inference at any location Y' is

E[Y'|Y] = mY + cTC-1(y-mY)

along with the prediction variance

Var(Y'|Y) = Var(Y) - cTC-1c

Fig.1 Sample process reconstruction from points in 1-dimension.

Fig.2 Point observation from coregionalized processes in 1-dimension.

Fig.3 Sample coregionalized processes reconstruction from points in 1-dimension.

2 Concepts

Enumerate every scenario can be tedious. I would try to significantly narrow down the scenarios in order to have a clean and easy start. First, I would like to frame the experiment within the field of geography. But this is still overwhelming as there are already many versions of Kriging or Gaussian Process Regression taking care of using point observations in various situations. So, second, I will start with a special scenario where inference of continuous geospatial field through point observation is assisted by regularly gridded data, and envision that growing availability of airborne data would underpin this scenario as a major research theme very soon. This snippet would appear to be bizarre at the very beginning since it tries to be general on a very special case. But I believe as we build more upon the theme, the value would grow in either research or education. In a nutshell, we will:

  • explore the reliablity of using point observations along with covariate(s) in feild reconstruction in various scenarios;
  • explore the performance of commonly used algorithms in field construction with point observations and their grid covariate;
  • how different scenarios are manifested in situations with real life data.

Thus the experiment is deployed with dummy datasets and then the real ones.

Using Earth Observation (EO) data as a complement or even replacement of ground-based sparse measurements is prevalent in many mapping activities, relevant ones can be soil, humidity, pollutants, particles, energy, etc. Such application of EO data in understanding and interpreting geographic processes becomes universal along the increasing availability of satellite imagery data, which renders the interpolating point measurements assisted by gridded imagery data as a scientific problem. In reflection of the title, I coin the problem as Sparse point interpolation with dense grid covariate(s). To investigate such problem, there are, again, already many potential scenarios: the grid covariate (satellite imagery data) can either weakly or strongly correlated to the point observation; the point observations can be dense or sparse, the point observations can be regularly or irregularly distributed; one or both of the grid covariate and point observations can be noise perturbed; the noises can be of different properties (noise colors) and intensities...But I will try to not reduce the scenarios any more and inspect them incrementally.

The concept is intuitive: I will start to explore how continuous field reconstruction is impacted by scenarios of observations at hand, this can simply be inspected by watching the accuracy varies along with different scenarios. Then I will use real datasets for comparison and see if such variation is manifested.

2.1 Dummy dataset

To realize the concept, a dummy dataset is created representing any field that can be modelled by f (Fig.4). In geostatistics, we also would like to know what the most conventional parameters are so that reconstructed field through geostatistical methods can be compared and validated. Thus I also try to approximate the dummy datasets using the most frequently used geostatistical method: the Kriging, which is more widely known as the Gaussian Process. Here I will continue to use Gaussian Process for the notion that it is more general in the field statistics and has a longer history. Holding this dummy field with ground truth geostatistical parameters, the point observations and their grid covariate are created out of this ground truth field to simulate scenarios of situations we may encounter. For instance, a simple scenario could be achieved by linear transformation of the ground truth field to obtain the grid covariate, while the point observations can be semi-random samples from the ground truth field (Fig.4). More general scenarios can be obtained by simulating and adding possible noises on top of the simple scenario with only linearly transformation, whereas points can also be perturbed and more sparse.

Fig.4 Conceptual design for examining field reconstruction through dummy point and grid covariate datasets.

The reconstruction can be achieved through either conventional geostatistical methods, or more advanced machine learning based techniques, but in either case the problem remains in the domain of regression, so any regression technique is subject to be used for comparison or setting the benchmark (to be decided in progress).

2.2 Real datasets

The rationale of selecting real life cases is that continuous field is preferred. In the application of interpolating point observations, I would like to first avoid tricky situations of encountering sharp edges and patchness on the surface of fields. Thus I attempt to avoid any form of recognizable edges or boundaries in terms of the so called fiat and bona fide boundaries. In this sense, airborne geographic phenonmenon which disperse over space and time would be handy choices. Here, I use near surface air temperature and pollutant NO2 (Fig.5). Examination of these phenomena is practically significance as relevant to our environment, climate, and health. Below is a sample map of surface temperature and NO2 at local scale around the city of Utrecht, the Netherlands.

Fig.5 Real datasets: air pollution (left) and surface temperature (right).

Both of the maps provide spatial distribution of temperature and NO2 with resolution higher than 30m. The disperstion patterns on the two maps are distinctively different but intuitive and informative: high temperature is observed around densely built areas where building outlines are visible, whereas high concentration of NO2 is around transportation arterials and road networks are highlighted. If one needs to see this kind of maps as frequent as daily or even hourly, no existing sensor would meet the demand. In fact, creating such maps daily or hourly largely relies on in-situ observations, which can be dense or sparse depends on the place. In recent decades, the development of EO infrasctructure provides a significant complement to the in-situ observations. However, the power of the EO in complementing the ground based in-situ has not been thoroughly inspected. For instance, whether different spatial and temporal resolutions of the EO can capture the variations of the target phenomenon. More importantly, what EO actually "see" can be quite different from the target phenomenon measured near the ground, such as temperature--the gap between surface and near surface air temperature is insufficiently understood. This potential weak correlation is well aligned with the scenario of using less desirable grid covariate for point observation interpolation. In short, framing these real life cases into the problem of Sparse point interpolation with dense grid covariate(s) is suitable.

3 Techniques

3.1 Incremental actions

A general framework of playing with dummy datasets for scenario analysis needs to encompass (again can be referred back to Fig.4):

  • ground truth continuous random field generated by any arbitrary function f;
  • ground truth geostatistical parameter approximated by Gaussian Process;
  • scenario of sparse or dense point observations Z generated by sampling from the ground truth field f;
  • scenario of grid covariate g generated by transforming the ground truth field f;
  • reconstruct the ground truth f as a simulated field f' by using Z and g;
  • inspect the results in terms of accuracy/uncertainties (the inspection is conducted by applying various algorithms varying from conventional geostatistical inference to advanced machine learning techniques).

Critical actions can easily be identified at each point of data generation: ground truth field, point observations, and grid covariate. Selection of a particular algorithm for reconstructing the field can also be considered as another action, which I will put aside for now. Let's focus on data generation. In order to examine the accuracy as a function of the properties of the generated data, any action generating the data with certain property needs to be parameterized. For instance, the accuracy can be inspected by one of the property of the generated data--the noise intensity or variance of the added noise to the point samples. Concretely, the accuracy is governed as:

acc = f(actions)

,where actions are parameterized according to the nature of the actions themselves. For instance, the action of generating point samples can be parameterized by the distribution of the sample locations. In this case, by saying "distribution", I need quantities such as randomness and adjacency (sparsity or density to separate samples). I also need sampling schemes such as blind or adaptive to create samples following parameterized distribution. For another instance, to make the point samples less ideal and more general, the action of adding noise to point samples can be parameterized according to how intense the noise can be. In this case, by saying "how intense" or "intensity", I may further constrain the idea as two quantities: variance(noise value is large or small) and lengthscale(how quickly the noise varies across space). However, apart from these particular cases, how many actions are expected and what are the parameters? It seems the action(s) can be arranged incrementally to render dummy datasets from ideal to general as in table below:

Table 1 Actions for parameterizing scenarios that can be conducted incrementally.

Here, I identify three major actions, which can be conducted incrementally to generalize scenarios, such as perturbation can be added on top of point sampling to approximate a realistic condition. The actions boil down hierarchically into quantities to be parameterized, which successively forms scenarios. According to the amount of quantities to be parameterized in Table 1, it seems, again, many scenarios can be explored. But now the steps are clear, the starting point can be the most basic ones with blind sampling of point observations and covariate obtaind through simple linear transformation of the ground truth continuous field.

3.2 Action parameterization

One option for the basic scenarios can be blind sampling of point obsevations and linear transformation covariate without any perturbation. This basic scenario can be a showcase of action parameterization. In this case, I also fixed the linearly transformed covariate simply as -1.0 scaled original ground truth field. Thus only the point sampling scheme can be parameterized. By referring to blind sampling, I created the artificial point observation of the original field by applying the Poisson-Disc Sampling to ensure there will be no clusters or gaps among the sampled points caused by the commonly used pure random sampling. The Poisson-Disc Sampling leads to a more natural point patterns distributed over an area. With this specific blind sampling scheme, there is only one parameter needed to control this sampling action: density or sparsity. Briefly, the basic scenario would be a reduced version of Fig.4 as shown in Fig.6 below.

Fig.6 Basic scenario with sparsity of point observation as parameterized action.

Frankly, noise-free situation is too ideal to leverage the strength of any type of Gaussian Process modeling and be used as a standard scenario to benchmark different inference technique. By hypothesizing the points and covariate as correlated Gaussian Processes, the Coregionalized Gaussian Process modeling implemented in python as GPy can easily find out that the correlation between the point observations of the field and covariate is -1.0 as the values in the points vary rigorously to the opposite to the corresponding values in the covariate. Sparsifying the point observations does not impact the result thus the accuracy stays with RMSE=0. More specifically, as I created the ground truth field with a variation lengthscale of 3.0, the Gaussian Process modeling also found this lengthscale of the ground truth field quite accurately along the changing sparsity of the artificial point observation (Fig.7).

Fig.7 Inference of the lengthscale of the ground truth field along sparsified point observations.

It is safe to conclude that making inference about the underlying field with only a few point observations can be quite reliable under the noise-free scenario. Of course, the covariate is also created with a very simple and restricted manner. If the distribution of point observations does not matter that much, we need more serious experiment where the situation of covariate and noise are less ideal.

4 Experiment

Fig.8 The impact of actions regard to changing noise variance and lengthscale.