/StatExplore

A Python Package to Facilitate Statistical Research

Primary LanguageJupyter Notebook

Distance Correlation

       In statistics, the Pearson product-moment correlation coefficient (or simply "the correlation coefficient") is a standard measure of the extent and direction to which two variables move together. Ranging from [-1,1] where 1 implies perfect correlation and -1 implies perfect inverse correlation, this statistic encapsulates the ratio between two variables' covariance (the numerator) and the product of their variances (the denominator).

Equation 1: Pearson Product-Moment Correlation Coefficient

       An key assumption of this statistic is that the underlying relationship between these two statistics is linear. However, this assumption of linearity is often not borne out in reality. Imagine we are assessing the relationship between the amount of money spent on ads targeting visitors of a given website, and the rate of conversion from visitor to paying customer. We could easyily imagine a scenario where up to a certain point, more resources spent on ads tends to increase conversion. However, there may come a point where the prevalence of ads is so great that it is actually offputting to the consumer, accomplishing the opposite of its intended purpose. This scenario is not theoretical, but has been validated by survey data. The implication is that while ad spend may relate intimately to conversion, the correlation coefficient between these two variables is likley to be small - to the point of approaching zero.

       The scatterplots below illustrate how, when the relationship between two variables involves a change in direction, the Pearson Product-Moment Correlation Coefficient fails to report the true degree of dependence between variables.

Image 1: Sets of Pearson Correlation Coefficients

SOURCE: https://commons.wikimedia.org/wiki/File:Correlation_examples2.svg

       In 2007, Gábor J. Székely called attention to this important limitation of the correlation coefficient and introduced the concept of 'distance correlation' as part of his conception of 'E-statistics' - statistics concerning the energy distance between probability distributions. Within the framework of E-statistics, Székely re-formulated many classical statistical concepts, such as 'distance variance' versus variance, 'distance standard deviation' versus standard deviation, and 'distance covariance' versus covariance. Using these, the definition of correlation coefficient can be re-written, but in such a way that a value of zero occurs if, and only if the two variables are genuinely independent.

Equation 2: Distance Correlation

Image 2: Sets of Distance Correlation Coefficients

SOURCE: https://commons.wikimedia.org/wiki/File:Distance_Correlation_Examples.svg

Calculating the Distance Covariance

For example, let's create some data using R:

x = c(0, 1, 2, 3, 4) 
y = c(2, 1, 0, 1, 2) 

Next, we derive a matrix for each variable containing the pairwise distances for that variable. For the purposes of calculating the distance covariance, we use the Euclidean distance. If we were exploring two-dimensional observations (for example, on the Cartesian plane) the appropriate formulation of the Euclidean distance would be as follows:



However, in the example below X and Y are each univariate, and so the Euclidean distance reduces to the absolute value of the differences between observations.



This can be done in R by calling the 'dist' method and specifying "euclidean" as the distance.

x_mat <- dist(x, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
y_mat <- dist(y, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)

We will also need the column and row means from these distance matrices, as well as the grand mean of those means. If you were to derive these manually, you might use a function like the following:

take_doubly_centered_distances <- function(x_mat) {
    library(reshape2)
    x_df               <- melt(as.matrix(x_mat), varnames = c("row", "col"))
    x_row_means        <- aggregate(x_df, list(x_df$row), mean)
    x_row_means        <- subset(x_row_means, select = -c(Group.1, col))
    names(x_row_means) <- c("row", "row_mean")
    x_df               <- merge(x=x_df, y=x_row_means, by="row")
    x_col_means        <- aggregate(x_df, list(x_df$col), mean)
    x_col_means        <- subset(x_col_means, select = -c(Group.1, row, row_mean))
    names(x_col_means) <- c("col", "col_mean")
    x_df               <- merge(x=x_df, y=x_col_means, by="col")
    x_df$grand_mean    <- mean(c(x_row_means$row_mean, x_col_means$col_mean)) 
    x_df$X             <- x_df$value - x_df$row_mean - x_df$col_mean + x_df$grand_mean 
    x_df = x_df[with(x_df, order(col, row)), ]
    myList <- list()
    for (i in unique(x_df[["col"]])){
      myList[[length(myList)+1]] <- x_df[x_df$col == i,]$X
    }
    output <- matrix(unlist(myList), ncol = length(unique(x_df[["col"]])), byrow = TRUE)
    return(output)
    }

..resulting in the following:

X Pair-Wise Distances Y Pair-Wise Distances
X A B C D E Row
Mean
A 0 1 2 3 4 2
B 1 0 1 2 3 1.4
C 2 1 0 1 2 1.2
D 3 2 1 0 1 1.4
E 4 3 2 1 0 2
Column
Mean
2 1.4 1.2 1.4 2 Grand
Mean = 1.6
Y A B C D E Row
Mean
A 0 1 2 1 0 0.8
B 1 0 1 0 1 0.6
C 2 1 0 1 2 1.2
D 1 0 1 0 1 0.6
E 0 1 2 1 0 0.8
Column
Mean
0.8 0.6 1.2 0.6 0.8 Grand
Mean = 0.8

Tables 1 & 2: Pair-Wise Distances

We need to doubly center these distance matrices - doubly in this context means we will first subtract from each element its row mean, and secondly subtract its column mean before adding to each element the grand mean.

The resulting matrices should have all rows and all columns sum to zero.

X Doubly Centered Distances Y Doubly Centered Distances
X A B C D E Row
Sum
A -2.4 -0.8 0.4 1.2 1.6 0
B -0.8 -1.2 0 0.8 1.2 0
C 0.4 0 -0.8 0 0.4 0
D 1.2 0.8 0 -1.2 -0.8 0
E 1.6 1.2 0.4 -0.8 -2.4 0
Column
Sum
0 0 0 0 0
Y A B C D E Row
Sum
A -0.8 0.4 0.8 0.4 -0.8 0
B 0.4 -0.4 0 -0.4 0.4 0
C 0.8 0 -1.6 0 0.8 0
D 0.4 -0.4 0 -0.4 0.4 0
E -0.8 0.4 0.8 0.4 -0.8 0
Column
Sum
0 0 0 0 0

Tables 3 & 4: Distance Matrices After Doubly Centering

Next, we need to take the arithmetic average of the products of the doubly centered matrices. The summed products is also referred to as the Frobenius inner product, which we subsequently multiply times 1 over n squared to yield the arithmetic average.



Equation 3: Squared Sample Distance Covariance

We can manually do this in R via the 'matrixcalc' library.

arithmetic_average_of_products <- function(x_mat, y_mat) {
  library(matrixcalc)
  if ((nrow(x_mat) == nrow(y_mat)) & (ncol(x_mat) == ncol(y_mat))) {
    val <- frobenius.prod(x_mat, y_mat)
    return(val*(1/nrow(x_mat)^2))
  }
}

Finally, we take the square root of this result to get the sample distance covariance. If compare the results with R's 'energy' package, we see that the results are the same:

> arithmetic_average_of_products(x_mat, y_mat)^(1/2)
    0.438178
> 
> library(energy)
> dcov.test(x, y, index = 1.0, R = NULL)

	Specify the number of replicates R (R > 0) for an independence test

data:  index 1, replicates 0
nV^2 = 0.96, p-value = NA
sample estimates:
    dCov 
    0.438178 

Calculating the Distance Standard Deviations

References