possible issue with data splitting in all notebook examples

Question

possible issue with data splitting in all notebook examples

Closed this issue 3 months ago · 3 comments

listar2000 commented 3 months ago

Hi, I have a question regarding how data splitting has been done in the example notebooks. It is pretty obvious so I don't know whether it is a problem or is intentional. Take the galaxies.ipynb as example (I think this snippet is reused everywhere though):

# Run prediction-powered inference and classical inference for many values of n
results = []
for i in tqdm(range(ns.shape[0])):
    for j in range(num_trials):
        # Prediction-Powered Inference
        n = ns[i]
        rand_idx = np.random.permutation(n_total)
        _Yhat = Yhat_total[rand_idx[:n]]
        _Y = Y_total[rand_idx[:n]]
        #  _Yhat_unlabeled = Yhat_total[rand_idx[n:]] --> possible correction
        _Yhat_unlabeled = Yhat_total[n:] # --> this is not using the permuted indices?

        ppi_ci = ppi_mean_ci(_Y, _Yhat, _Yhat_unlabeled, alpha=alpha)

It seems that the _Yhat_unlabeled vector is not constructed through the randomly permuted indices for each trial, while the gold standard _Yhat does. Essentially no matter how many repetitions are made, the former vector stays the same.

Answer 1 · 2024-09-16T23:26:23.000Z

You're totally right.
My guess as to why this didn't cause an error is that $N$ is very large, so

Yhat_total[n:].mean() $\approx$ Yhat_total[rand_idx[n:]].mean()

This is a great find. Would you be willing to open a PR that

Changes this in all notebooks
Re-runs all notebooks, from top to bottom, with the new code

Thank you so much for your help @listar2000 !

Answer 2 · 2024-09-16T23:49:48.000Z

Thanks for the quick response @aangelopoulos and I'm very willing to do that. Will send a PR soon.

Answer 3 · 2024-09-23T01:49:37.000Z

Addressed by #19 ; Thank you @listar2000 !