KaveIO/PhiK

Unstable correlation values for uncorrelated variable pair

winston-zillow opened this issue · 3 comments

I found that Phi-k produces vastly different values for the uncorrelated variable pair when a sample is resampled with replacement (such as for bootstrapping.) But the correlation values for the correlated variable pair are stable. The distributions of the correlation also look very different when sampling with replacement is used, as seen in the follow plots:
Screen Shot 2022-04-04 at 11 39 37 AM
than that for the unresampled data:
Screen Shot 2022-04-04 at 11 39 05 AM

The test below shows the issue. As seen, even just replicating the sample twice would produce far away correlation value from the original.

import numpy as np
import pandas as pd
from phik import phik_matrix
from phik.phik import auto_bin_data

def gen_data(n=4000):
	x = np.random.normal(0.0, 5.0, size=n)
	y = x * 1.23 + np.random.uniform(-8.0, 8.0, size=len(x))
	data = {'x': x, 'y': y}
	z = np.random.uniform(-8.0, 8.0, size=len(x))
	z2 = np.random.uniform(-8.0, 8.0, size=len(x))
	cat = np.random.choice(['cat', 'dog', 'mouse'], size=len(x), replace=True)
	data.update({'uncorr_z': z, 'uncorr_z2': z2, 'uncorr_trinary': cat})
	return pd.DataFrame(data)

def phik_bins(X: pd.DataFrame, interval_cols):
    _, binning_dict = auto_bin_data(df=X, interval_cols=interval_cols, bins=20, quantile=True)
    return {k: [pairs[0][0]] + [edges[1] for edges in pairs]
            for k, pairs in binning_dict.items()}

def phik_corr_matrix(X: pd.DataFrame, interval_cols, bins, noise_correction=True):
	# have also tried noise_correction=True, same results
    return phik_matrix(X, interval_cols, bins, noise_correction=noise_correction)

X = gen_data(4000)
interval_cols = [col for col in X.columns if not col.endswith('ary')]

bins = phik_bins(X, interval_cols)
print(f"corr matrix for the full dataset. len={len(X)}")
print(phik_corr_matrix(X, interval_cols, bins))
print('--------------')

x = pd.concat([X, X])
x = x.sample(4000, replace=False)
print(f"corr matrix for the full dataset concatenated 2x and resample without replacement. len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins))
print('--------------')

x = pd.concat([X, X, X, X])
print(f"corr matrix for the full dataset concatenated 4x. len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins))
print('--------------')

x = X.sample(4000, replace=True)
print(f"corr matrix for sample with replacement (as for bootstrap). len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins))
print('--------------')

x = X.sample(4000, replace=False)
print(f"corr matrix for sample without replacement. len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins))

The output:

corr matrix for the full dataset. len=4000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.758600  0.082232   0.000000        0.089581
y               0.758600  1.000000  0.092561   0.000000        0.000000
uncorr_z        0.082232  0.092561  1.000000   0.000000        0.021255
uncorr_z2       0.000000  0.000000  0.000000   1.000000        0.025393
uncorr_trinary  0.089581  0.000000  0.021255   0.025393        1.000000
--------------
corr matrix for the full dataset concatenated 2x and resample without replacement. len=4000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.763103  0.231890   0.179819        0.156664
y               0.763103  1.000000  0.247571   0.217923        0.063653
uncorr_z        0.231890  0.247571  1.000000   0.180514        0.081663
uncorr_z2       0.179819  0.217923  0.180514   1.000000        0.100264
uncorr_trinary  0.156664  0.063653  0.081663   0.100264        1.000000
--------------
corr matrix for the full dataset concatenated 4x. len=16000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.767556  0.268428   0.245739        0.144339
y               0.767556  1.000000  0.271354   0.256555        0.080950
uncorr_z        0.268428  0.271354  1.000000   0.239031        0.116489
uncorr_z2       0.245739  0.256555  0.239031   1.000000        0.117280
uncorr_trinary  0.144339  0.080950  0.116489   0.117280        1.000000
--------------
corr matrix for sample with replacement (as for bootstrap). len=4000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.758930  0.306785   0.283761        0.171000
y               0.758930  1.000000  0.322646   0.308904        0.000000
uncorr_z        0.306785  0.322646  1.000000   0.285894        0.111977
uncorr_z2       0.283761  0.308904  0.285894   1.000000        0.170866
uncorr_trinary  0.171000  0.000000  0.111977   0.170866        1.000000
--------------
corr matrix for sample without replacement. len=4000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.758600  0.082232   0.000000        0.089581
y               0.758600  1.000000  0.092561   0.000000        0.000000
uncorr_z        0.082232  0.092561  1.000000   0.000000        0.021255
uncorr_z2       0.000000  0.000000  0.000000   1.000000        0.025393
uncorr_trinary  0.089581  0.000000  0.021255   0.025393        1.000000

Is that expected? Or there's some problems with the implementation?

mbaak commented

Hello, thanks for posting this.

This behaviour is expected, in particular for relatively small datasets (low stats per contingency table bin) with a lot of resampling. Resampling introduces duplicate data points, meaning that the fluctuations in each contingency table bin are no longer described by a normal chi-square distribution.
That's an important assumption behind PhiK: the chi2 distribution assumes sqrt(n_i) statistical fluctuations in each bin i (with n_i data points), which is only the case when all data points in a bin are independent. With resampling the fluctuations per bin are bigger, b/c there are fewer independent data points per bin - in essence the independent fluctuations per bin are scaled up by the duplicated data points. This leads to a bigger chi2 value, leading to a bigger value of phik, which is what you see.

What you have found is essentially issue 26:
#26
which is that PhiK does not work for weighted data.

I'll probably add a warning about this in the readme.

Thanks.

In addition, I found that I made a small copying mistake (already fixed) in the above code snippet: the noise_correction argument should be defaulted to True as is the case in phik_matrix to produce the results.

If I set noise_correction=False, then the correlation values for the uncorrelated variables are high regardless if replacement resampling is done or not, as shown:

X = gen_data(4000)
interval_cols = [col for col in X.columns if not col.endswith('ary')]

bins = phik_bins(X, interval_cols)
print(f"corr matrix for the full dataset. len={len(X)}")
print(phik_matrix(X, interval_cols, bins, noise_correction=False))
print('--------------')

x = pd.concat([X, X]) # .iloc[shuffle(np.arange(0, 2*len(X)))]
print(f"corr matrix for the full dataset concatenated 2x and resample without replacement. len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins, noise_correction=False))
print('--------------')

x = pd.concat([X, X, X, X]) # .iloc[shuffle(np.arange(0, 4*len(X)).astype(np.int32))]
print(f"corr matrix for the full dataset concatenated 4x. len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins, noise_correction=False))
print('--------------')

x = X.sample(4000, replace=True)
print(f"corr matrix for sample with replacement (as for bootstrap). len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins, noise_correction=False))
print('--------------')

x = X.sample(4000, replace=False)
print(f"corr matrix for sample without replacement. len={len(x)}")
print(phik_corr_matrix(x, interval_cols, bins, noise_correction=False))

produces

corr matrix for the full dataset. len=4000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.763830  0.290781   0.295869        0.147467
y               0.763830  1.000000  0.295795   0.291159        0.127338
uncorr_z        0.290781  0.295795  1.000000   0.292964        0.111849
uncorr_z2       0.295869  0.291159  0.292964   1.000000        0.118586
uncorr_trinary  0.147467  0.127338  0.111849   0.118586        1.000000
--------------
corr matrix for the full dataset concatenated 2x and resample without replacement. len=8000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.763830  0.290781   0.295869        0.147467
y               0.763830  1.000000  0.295795   0.291159        0.127338
uncorr_z        0.290781  0.295795  1.000000   0.292964        0.111849
uncorr_z2       0.295869  0.291159  0.292964   1.000000        0.118586
uncorr_trinary  0.147467  0.127338  0.111849   0.118586        1.000000
--------------
corr matrix for the full dataset concatenated 4x. len=16000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.763830  0.290781   0.295869        0.147467
y               0.763830  1.000000  0.295795   0.291159        0.127338
uncorr_z        0.290781  0.295795  1.000000   0.292964        0.111849
uncorr_z2       0.295869  0.291159  0.292964   1.000000        0.118586
uncorr_trinary  0.147467  0.127338  0.111849   0.118586        1.000000
--------------
corr matrix for sample with replacement (as for bootstrap). len=4000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.779822  0.363354   0.388268        0.202001
y               0.779822  1.000000  0.389631   0.400580        0.181017
uncorr_z        0.363354  0.389631  1.000000   0.391259        0.162702
uncorr_z2       0.388268  0.400580  0.391259   1.000000        0.172511
uncorr_trinary  0.202001  0.181017  0.162702   0.172511        1.000000
--------------
corr matrix for sample without replacement. len=4000
                       x         y  uncorr_z  uncorr_z2  uncorr_trinary
x               1.000000  0.763830  0.290781   0.295869        0.147467
y               0.763830  1.000000  0.295795   0.291159        0.127338
uncorr_z        0.290781  0.295795  1.000000   0.292964        0.111849
uncorr_z2       0.295869  0.291159  0.292964   1.000000        0.118586
uncorr_trinary  0.147467  0.127338  0.111849   0.118586        1.000000

What does this noise_correction do to cause the big discrepancy?

mbaak commented

The impact of the noise_threshold is as intended. The chi2 value of the contingency table only gets interpreted as a tilted bivariate normal distribution (with correlation phik) when it passes a minimum threshold value. The reason behind this is that zero-correlation still produces a (small) non-zero chi2 value. If below threshold, phik is set to zero. In your studies, turning on the noise_threshold causes a/the peak at phik=0. When turned off phik is always evaluated. I recommend setting it to true.

For a detailed description see section 4.4 of our publication:
https://arxiv.org/pdf/1811.11440.pdf

Be sure to check the statistical significances of the correlations found. I suspect they are low in this case. For low statistics samples phik is affected by statistical fluctuations (like any correlation constant). In case of low statistics the spread in the reported values goes up, but the median is unbiased (see publication). So in case of a low statistics sample, be sure to always check the significance of the correlations found (again this holds for any correlation constant).

For phik you can call:
df.significance_matrix()
to get the Z-values.

In practice, due to the threshold, phik is only evaluated when (roughly) Z>0.5. For uncorrelated distributions the Z scores lie around zero, and also none of the Z scores are truly significant, say with Z>5. Btw I'm assuming independent data points here.