soltislab/BotanyENMWorkshops

Bug for spacial correction of small datasets

Closed this issue · 5 comments

Hi
I noticed an issue with the spacial correction script (lines 123-128 in the data cleaning section) when applied to small datasets with relatively uniform spacing among points. I initially ran the script on a 34 observation dataset and a 2.5'' resolution geotif, and I got back 3 points after correction. I integrated data from other sources to get my initial sample size up to 48, but this time I got back 2 points after correction. I realized that by filling in gaps in the distribution, the additional data made the observations more uniformly distributed. To illustrate, imagine 100 points in a line spaced 1 m apart. Attempting to satisfy a 2 m resolution by sequentially eliminating points with the smallest distance from its neighbor will result in deleting all but one data point because of uniform distribution. I have a workaround which is highly inelegant and computationally inefficient but seems to reasonably mitigate the glitch:

orginal:

Remove a point which nearest neighbor distance is smaller than the resolution size

aka remove one point in a pair that occurs within one pixel

while(min(nndist(df[,6:7])) < rasterResolution){
nnD <- nndist(df[,6:7])
df <- df[-(which(min(nnD) == nnD) [1]), ]
}

My soultion: 1) use a conditional to only apply the fix on data sets that have <300 observations (where the issue is most likely to be observed), 2) randomly choose which minimum distance point will be deleted from the pool of equally spaced candidates in each iteration of the while loop, and 3) repeat the while loop 50x and keep the iteration that retained the highest number of observations:

if (nrow(df) < 300) {
testdf <- df
result_holder = list()
for (i in 1:50){
result_holder[[i]] <- testdf
while(min(nndist(result_holder[[i]][,6:7])) < rasterResolution) {
nnD <- nndist(result_holder[[i]][,6:7])
result_holder[[i]] <- result_holder[[i]][-(sample(which(min(nnD) == nnD)) [1]), ]
}
}
df <- result_holder[[which.max(sapply(result_holder, nrow))]]
}
else {
while(min(nndist(df[,6:7])) < rasterResolution){
nnD <- nndist(df[,6:7])
df <- df[-(which(min(nnD) == nnD) [1]), ]
}
}

Best,
Tito

The goal of this section is to retain 1 point per pixel, - this is not a form of spatial thinning. Maxent is a psuedo-absent/present model, it only counts one point per pixel.

  • 30 arc sec = 1 km square
  • 2.5 min is 5x the size of 30 arc sec ~ 5 km square.

Original code

while(min(nndist(df[,6:7])) < rasterResolution){ # while points are closer together than the raster resolution (ie. 1km square or 5km square)
     nnD <- nndist(df[,6:7]) #calculate dist
    df <- df[-(which(min(nnD) == nnD) [1]), ] #remove one of the two points too close together
}

Your code

  • I think randomly sampling which point is obtained is fine, but limiting this to a loop of 50 for 300 points is not justified. Limiting the loop will not create the equivalent dataset to what is used in Maxent - so, this will give you a false sense of degrees of freedom, make all point-based comparisons not comparable, and is not a good choice.
  • At most, I would modify the code as followed:
while(min(nndist(df[,6:7])) < rasterResolution){ 
nnD <- nndist(df[,6:7])
df <- df[-(sample(which(min(nnD)) == nnD)) [1]), ]
}
`

If you want to retain more points, throw out the 2.5 min soil layers and use only the BioClim 30 arc sec layers.

I also recommend you check out Andre's recent paper:https://doi.org/10.1111/csp2.621

I agree my solution is pretty terrible. More of a 'band-aid' but at 50 reps its seems to equilibrate on retaining the same points (at least in my dataset). Totally agree that higher resolution would be ideal, but I still think that if I am putting more data in and getting less out that it points to a bug. This is all liklely due to a misunderstanding on my part. To make sure:
eg.csv

If you run it on this fake 1000 point repeating line dataset, data retention is baised towards certain longitudes. Is it working as designed?

"I also recommend you check out Andre's recent paper:https://doi.org/10.1111/csp2.621"
Thanks I'll take a look.

First, you supplied bias points:
Rplot01

AND every single one of your points is within 5km from another:

library(fields)      
nnDm <- rdist.earth(as.matrix(data.frame(lon = df$long, lat = df$lat)), miles = FALSE, R = NULL) #. creates distance matrix in km.      
nnDmin <- do.call(rbind, lapply(1:1099, function(i) sort(nnDm[,i])[2])) # sorts to find minimum for all points supplied, the first number will always be 0 since its the distance from itself (its a matrix).      
nnDmin >= 5 # when I run this line, I am seeing if the minimum distance away from any other point is 5km

Now if I randomly sample points, as defined above, then I ended up sampling the red:
image

With sample(), you should not retain the same points each run. With only 50 reps, you will not retain one point per pixel and will have multiple points per pixel remaining. Without sample(), the method loops through your dataset in order, so yes, it will seem to select similar lat/long. BUT, this is not a bug or a bias.

Since this is not an error with our code, i'm going to close the issue. Please see Andre's paper to see how to deal with endemic species in small ranges.

Fair. Sorry I wasted your time.