mateuszbuda/ml-stat-util

Possibly Incorrect Calculation of p-Value

Closed this issue · 3 comments

Hi Mateusz. Thanks for the super helpful repo! But if I'm not mistaken, I think there's a theoretical error in the way you're calculating your p-value.

Right now, you're computing the p-value for a difference in the performance of two models, measured with AUROC, using the following code:

from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

p, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)
bins = plt.hist(z)
plt.plot([0, 0], [0, np.max(bins[0])], color="black")
print('p =', p)
plt.xlabel('Difference in AUROC')
plt.ylabel('Frequency')
plt.show()

image_01

Here, your null hypothesis is that the two models perform equally well, i.e. the difference between their AUROC is equal to 0. And your alternative hypothesis is that model 1 performs better than model 2 (based on the internal code for the function). z contains a list of differences in AUROC for randomly sampled subsets of predictions from both models. p is the probability of the difference in AUROC being 0. In other words, your p-value is the probability of the null hypothesis (difference in AUROC = 0) being true. However, theoretically, the p-value is actually the probability of an observed sample statistic, given that the null hypothesis is true. These two statements aren't the same.

Hence, the revision I propose using your same function, is this:

from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
import stat_util

_, z = stat_util.pvalue(y_true, y_pred1, y_pred2, score_fun=roc_auc_score)

# Observed Sample Statistic (Difference in AUROC of Model 1 and 2)
sample_diff = roc_auc_score(y_true,y_pred1) - roc_auc_score(y_true, y_pred2)

# Simulate Distribution of Null Hypothesis using Statistics from Bootstrapping
null_vals = np.random.normal(loc = 0.00, scale = np.std(np.abs(z)), size=2000)
null_dist = plt.hist(null_vals, color='red', alpha=1.0)

# Display Observed Sample Statistic in Same Plot 
plt.axvline(sample_diff, color="black")

# Compute p-Value
print('p =',1-percentileofscore(null_vals, sample_diff, kind="weak") / 100.0)
plt.xlabel('Difference in AUROC')
plt.ylabel('Frequency')
plt.show()

image_02

Here, the p-value is the probability of observing a difference in AUROC of the two models equal to the observed sample statistic [roc_auc_score(y_true,y_pred1) - roc_auc_score(y_true, y_pred2)] or greater, given that the null hypothesis is true.

If we only compare whether the p-value is below a certain significance level (eg. 0.05) or not, then your method and mine agree for all 10 cases of different model predictions, that I've tried them on. However, the actual p-value itself is always different.

What do you think?

Reference:
[1] https://www.khanacademy.org/math/ap-statistics/tests-significance-ap/idea-significance-tests/v/p-values-and-significance-tests
[2] https://towardsdatascience.com/bootstrapping-for-inferential-statistics-9b613a7653b2

Thanks for opening the issue. I'll look into it as soon as I have some time.

Hi Anindo. First, the method for computing p-value and CIs implemented in this repo was not my idea, it was consulted with a statistician.

In z, all differences that are smaller or equal to 0 are at least as extreme as the result actually observed, assuming null hypothesis that the difference between AUCs is equal to 0, which corresponds to one-sided p-value.

I think your interpretation is incorrect.
The computed p-value is not the probability of the null hypothesis (difference in AUC = 0) being true.

Not sure if my explanation is convincing.

It seems to me that your suggested method is an approximation of bootstrapping, which doesn't make sense to me. You add one more step of sampling from a distribution with variance computed on bootstrapped values. Also, your computed p-value is one-sided and you should multiply it by 2.

Hey Mateusz. Thanks for the prompt response. Yup, it's convincing. I see it a different way now.
Thanks for the clarification!