possible issue with data splitting in all notebook examples
Closed this issue · 3 comments
Hi, I have a question regarding how data splitting has been done in the example notebooks. It is pretty obvious so I don't know whether it is a problem or is intentional. Take the galaxies.ipynb
as example (I think this snippet is reused everywhere though):
# Run prediction-powered inference and classical inference for many values of n
results = []
for i in tqdm(range(ns.shape[0])):
for j in range(num_trials):
# Prediction-Powered Inference
n = ns[i]
rand_idx = np.random.permutation(n_total)
_Yhat = Yhat_total[rand_idx[:n]]
_Y = Y_total[rand_idx[:n]]
# _Yhat_unlabeled = Yhat_total[rand_idx[n:]] --> possible correction
_Yhat_unlabeled = Yhat_total[n:] # --> this is not using the permuted indices?
ppi_ci = ppi_mean_ci(_Y, _Yhat, _Yhat_unlabeled, alpha=alpha)
It seems that the _Yhat_unlabeled
vector is not constructed through the randomly permuted indices for each trial, while the gold standard _Yhat
does. Essentially no matter how many repetitions are made, the former vector stays the same.
You're totally right.
My guess as to why this didn't cause an error is that
Yhat_total[n:].mean()
Yhat_total[rand_idx[n:]].mean()
This is a great find. Would you be willing to open a PR that
- Changes this in all notebooks
- Re-runs all notebooks, from top to bottom, with the new code
Thank you so much for your help @listar2000 !
Thanks for the quick response @aangelopoulos and I'm very willing to do that. Will send a PR soon.
Addressed by #19 ; Thank you @listar2000 !