aangelopoulos/ppi_py

possible issue with data splitting in all notebook examples

Closed this issue · 3 comments

Hi, I have a question regarding how data splitting has been done in the example notebooks. It is pretty obvious so I don't know whether it is a problem or is intentional. Take the galaxies.ipynb as example (I think this snippet is reused everywhere though):

# Run prediction-powered inference and classical inference for many values of n
results = []
for i in tqdm(range(ns.shape[0])):
    for j in range(num_trials):
        # Prediction-Powered Inference
        n = ns[i]
        rand_idx = np.random.permutation(n_total)
        _Yhat = Yhat_total[rand_idx[:n]]
        _Y = Y_total[rand_idx[:n]]
        #  _Yhat_unlabeled = Yhat_total[rand_idx[n:]] --> possible correction
        _Yhat_unlabeled = Yhat_total[n:] # --> this is not using the permuted indices?

        ppi_ci = ppi_mean_ci(_Y, _Yhat, _Yhat_unlabeled, alpha=alpha)

It seems that the _Yhat_unlabeled vector is not constructed through the randomly permuted indices for each trial, while the gold standard _Yhat does. Essentially no matter how many repetitions are made, the former vector stays the same.

You're totally right.
My guess as to why this didn't cause an error is that $N$ is very large, so

Yhat_total[n:].mean() $\approx$ Yhat_total[rand_idx[n:]].mean()

This is a great find. Would you be willing to open a PR that

  1. Changes this in all notebooks
  2. Re-runs all notebooks, from top to bottom, with the new code

Thank you so much for your help @listar2000 !

Thanks for the quick response @aangelopoulos and I'm very willing to do that. Will send a PR soon.

Addressed by #19 ; Thank you @listar2000 !