morrowcj/remotePARTS

speeding up `distm_km`

Opened this issue · 4 comments

Dear Clay and Tony,

I wanted to speed up the calculation of maximum distance for fitCor.

Instead of running distm_km on the complete lat/lon matrix one could take only the coordinates of the points with xmin, xmax, ymin, and ymax combining them into a [4,2] matrix.
This can be easily achieved by selecting these points from a matrix or a sp object.

assuming GV.df3 is a data frame with all NA excluded, with lat & lon information in lat and lng columns we can extract the needed columns (and make reprojection from LAEA to WGS84, an additional step specific to my workflow)

# coordinates 
coords <- data.frame(GV.df3[,c('lng', 'lat')])
# recalculate LAEA(3035) to WGS84(4326)
coordinates(coords) <- c('lng', 'lat')
proj4string(coords) <- CRS("+init=epsg:3035")
coordsLL <- spTransform(coords, CRS("+init=epsg:4326")) 

and then calculate dist_km on the complete dataset

start_time <- Sys.time()
max.dist <- max(distm_km(coordsLL@coords ))
end_time <- Sys.time()
end_time - start_time
> Time difference of 34.43078 min
> max.dist = 39.05035

or use an alternative solution:

start_time <- Sys.time()
xmin <- coordsLL@coords[coordsLL@coords[,'lng']== min(coordsLL@coords[,'lng'])]
xmax <- coordsLL@coords[coordsLL@coords[,'lng']== max(coordsLL@coords[,'lng'])]
ymin <- coordsLL@coords[coordsLL@coords[,'lat']== min(coordsLL@coords[,'lat'])]
ymax <- coordsLL@coords[coordsLL@coords[,'lat']== max(coordsLL@coords[,'lat'])]

d.mat <- rbind(c(xmin),c(xmax),c(ymin),c(ymax))

max.dist <- max(distm_km(d.mat))
end_time <- Sys.time()
end_time - start_time 
> Time difference of 3.392486 secs
> max.dist = 38.28768

However, what troubles me is that the estimations are tad different: 39km vs. 38.3km. It is not a massive difference, yet both methods should, in principle, give exactly the same value, as the mare difference in the alternative approach is exclusion of 'redundant'. Since I am using 30-m dataset, pixel origin is also not really at play here.

I am wondering whether you deem the alternative solution valid, and do you have, by chance, any idea why the difference in maximum distance occurs?

Thank you and cheers,
Kasia

@Erfea, I think your idea of speeding up fitCor() is a very good one. I think the best way to go about doing so would be to allow users to specify the max_dist in their function call:

fitCor(resids = residuals, max_dist=38.3, coords = NA, dist_FUN = NA, ...)

This would bypass the internal calculation of max.distance and would instead use the specified value. This also means that the user could use any method of their choosing to obtain the maximum distance of their map, improving the utility and flexibility of the function (more on that in a separate comment).

How does this sound? I believe I can implement this today.

@Erfea, I think your solution is a good one for generic cases. But, please see the (closed) issue #17 where I discuss the issues with a function I had written called max_distance() (which is in the package). This function finds the maximum distance between any points on the a map of any shape by first fitting a convex hull to the coordinates and then calculating finding the max distance between the vertices (highly reduced number of coords). The logic behind that can be found here.

This, unfortunately, does not work well for many situations with global coordinate sets (or any non-cartesean coordinate system). I do not know of a better algorithm that works well for geographic data (although, I'm sure that one must exist). I am absolutely open to a solution to #17 and will reopen that issue if you think it is worth while to do so (and if you can point me to a proper algorithm/method).

@Erfea: Actually, my first comment is wrong. I just went back through the code and, unfortunately, the distances aren't just used to find the maximum distance. For each point in the sample (a subset of resids and coords determined by fit.n), the distances are correlated with the time series residuals. So, unfortunately, providing a max distance won't speed up this function (it doesn't take much time to find the max distance if you already have all the distances). Since the distances are only calculated amongst the subset of pixels, though, this is still much faster than calculating the max distance from the full map (which was done in the original version of the code).

It may still be useful to find a better max_distance() function, but I don't believe it will speed up fitCor() in any meaningful way.

I believe the time-consuming part of this function is actually the call to nls() (see lines 168-184 of the source code for fitCor). Finding the best spatial parameters is an iterative process (i.e., mathematical optimization). As far as we (Tony and I) know, there is not a more efficient way to solve this particular problem).

Dear Clay,
Thank you for looking into it so quickly!

You are right, if the D matrix, is available, then finding the max distance is a piece of cake. fitCor is typically executed on a subset but it should give a reasonable approximation. The fitCor has been the most tricky step in the processing for me, as looking for the appropriate start values for the range and shape is not trivial. But it is a different story.
Indeed, the potential time gain on this step is neglectable comparing to the other steps.

A convex hull sounds like an appropriate solution, and I am surprised to hear it does not work well in some cases. I will give it a thought. For now, the #17 can stay closed, but I will reach out if I will find a potential solution.