vegandevs/vegan

betadisper calculate distance between samples and the centroid of a different group.

joshgsmith opened this issue · 8 comments

The documentation is clear that betadisper() computes the distance between samples and their respective centroid or median while ensuring positive-definite eigenvalues. betadisper() also returns the principal coordinates of centroids, and these can be used to calculate the distances among centroids. However, I do not see any functionality to calculate the distance between samples and a centroid belonging to another group. For example, lets say we have 100 samples (we will call them 'sites'), 50 sites belonging to period == "Before" and 50 sites to period == "After." How can we determine the distances between each site belonging to period == "Before" and the centroid of period =="After"?

where m is a distance matrix, something like:
disper_mat <- betadisper_mod(m, type="centroid", group = group_vars2$period)
returns the distances between sites and their respective centroids, independently (in this case, one for Before and one for After)

If we want the principal coordinates of each centroid, we could use:

shift_dist <- reshape2::melt(as.matrix(sqrt(dist(m$centroids[,m$eig>0]^2)- dist(m$centroids[,m$eig<0]^2))))%>% tibble::rownames_to_column("distance")

However, shift_dist only finds the distance between the two centroids, not the distance between each samples and the centroid of a different group.

In both chunks above, only the within-group distances are calculated (distances from sites to their within group centroid). Is it possible to calculate the distance both within group and across groups? Specifically, the across group component is the distances of samples belonging to group Before to the centroid belonging to group After.

This would be a fantastic utility, particularly when dealing with time series and ecological data to examine multivariate 'shift distance' relative to a centroid defined by a certain time period. As an example, lets say we have ecological abundance data spanning 2000-2023. We could use the centroid of years 2000-2005 to describe the 'reference' period, then examine the annual shift distances for each year of the time series to estimate how much the community changes during the reference period vs. each year after that.

There is no such function. However, this is R and you can always write such a function!

Here is a function that calculates distances from each sampling unit to each centroid:

`betadistances` <-
    function(x)
 {
     cnt <- x$centroids
     coord <- x$vectors
     pos <- which(x$eig >= 0)
     neg <- which(x$eig < 0)
     d <- apply(cnt[,pos], 1,
                function(z) rowSums(sweep(coord[,pos], 2, z)^2))
     if (length(neg))
         d <- d - apply(cnt[, neg], 1,
                        function(z) rowSums(sweep(coord[,neg], 2, z)^2))
     d <- as.data.frame(sqrt(d))
     cbind("group" = x$group, d)
 }

This is a proof-of-concept implementation and may not cover all corner cases.

Is this the function you asked for? What do you think we should do with this? Comments @gavinsimpson

Note: vegan has a related function meandist, but it calculates mean distances among points and not distances to centroids.

@jarioksa this is very nice! I'm not sure I fully understand how the distances are calculated without calling dist() in that function, but I will apply it my my actual data today to test its functionality.

I was toying with something like:

shift_dist <- sqrt(dist(x$vectors[,x$eig>0]^2, x$centroids[,x$eig>0]^2)- dist(x$vectors[,x$eig<0]^2, x$centroids[,x$eig<0]^2))
Which doesn't seem to produce the same distances as the function you provided.

The betadistances function appears to work very well, and using usedist::dist_setNames() on the original distance matrix helps to keep track of the sample names through betadisper and betadistances.

@joshgsmith your way will not work. I saw you crossposted to StackOverflow. This is not a good habit to collect the answers. The usedist package suggested in StackOverflow won't work with semimetric dissimilarities (such as Jaccard, Bray-Curtis etc). This is documented in the usedist (but naturally, the developer may change that later). The method suggested above will also work with semimetric dissimilarities (non-semidefinite matrices).

@joshgsmith

I'm not sure I fully understand how the distances are calculated without calling dist() in that function

The key is this bit of code:

d <- apply(cnt[,pos], 1,
                function(z) rowSums(sweep(coord[,pos], 2, z)^2))

which for each row (group) in the matrix of centroids (for the axes with positive eigenvalues). cnt[,pos] we run an anonymous function. That function computes the Euclidean distance by doing:

  1. subtract the coordinates of the centroids from the coordinates of each sample. This uses sweep() to subtract a vector (a row of cnt[,pos] from the columns of a matrix coord[,pos],
  2. those differences are squared, as per the Euclidean distance definition, and
  3. we then sum those squared differences, with rowSums() doing this for the entire matrix of samples at once.

Note we don't then take the square root of these sums to complete the Euclidean distance because we may have to repeat the process on the PCO axes whose eigenvalues are negative. This happens in the if() clause.

The penultimate line of the function takes care of this square-rooting, once we've computed and subtracted off the distances in the imaginary coordinates, if there are any.

We can't call dist() because we are not computing the distances between samples in a matrix but between the samples of one matrix and the centroids stored in another matrix.

@jarioksa This would be nice to add to vegan; I've had people ask about this on courses before, and just pointed them at the code for betadisper() to see how we do it for the actual group.

I may add this function. It needs a bit of juggling with corner cases: distances to centroids can be imaginary (or squared distances negative), dims may drop, but I guess there will always be positive eigenvalues.

What about the name of this function? betadistances was just the first thing that came to my mind.

There may be other cases where we also want to have the distance to the centroid: envfit/factorfit, ordispider, ordihull etc...

I think I'll do this. I think this could be a generic function with a method for betadisper. However, for ordiellipse analogue we should consider Mahalanobis distances instead of Euclidean in the default case.

This is excellent @jarioksa and would be very useful. I've encountered a few other projects where knowing the distance between group centroids has very practical applications. Please let me know however I can help support. I don't have as much technical knowledge about betadisper but could certainly test the function on a number of different datasets.