Can we have an example on how to use PPBoot to calculate mean ci?

Question

Can we have an example on how to use PPBoot to calculate mean ci?

Closed this issue 8 months ago · 3 comments

Hi I don't really understand how to use the PPBoot function when I want to calculate the mean CI.

Firstly, I don't understand why would I need an estimator function if I'm already passing Y values?
Is Yhat just supposed to be of the same length as Yhat_unlabeled?

Y = np.random.normal(0, 1, 100)
Yhat = Y + 2
Yhat_unlabeled = np.ones(10000) * 2

num_trials = 2
results = []
for j in range(num_trials):
    # Prediction-Powered Inference
    ppi_ci = ppboot(lambda y: y, Y, Yhat, Yhat_unlabeled)
# ValueError: operands could not be broadcast together with shapes (10000,) (100,)

ps I used lambda y: y as an estimator as I rather not have to run my Machine Learning/LLM model to get the labels again? Or must I pass the model for it to do inference?

Answer 1 · 2024-07-25T07:16:08.000Z

The estimator argument takes as an argument the estimator you would like to use.
If you're hoping to estimate the population mean, the standard estimator would be the sample mean.

the stimator should be lambda y : y.mean(), since the sample mean is the estimator of the population mean.

Answer 2 · 2024-07-25T14:36:49.000Z

That works! Thanks!

However, looking at the output, there are some things I don't really understand...
The confidence interval of the mean is (-0.0032521653800733795, 0.0038037445287558255) which does not include the mean of my Yhat_unlabeled which is 2.

What's the intuition behind this? I always just expect the mean of my sample to be within the confidence interval and hence thought something went wrong when I saw the output. I have a feeling it's related to the poor predictions of the labelled data (i.e the difference in Y & Yhat)?
I guess as an extension, should I ever expect to see my confidence interval not include the mean of the labelled data?

Answer 3 · 2024-07-31T20:43:14.000Z

It should contain the mean of Y, not the mean of Yhat.
The Yhat variable is the synthetic data (ML generated), and it doesn't have the "right" mean. The Y data represents the small gold-standard dataset. So:

It won't always contain the mean of Yhat. It should contain the mean of Y with high probability.
With probability alpha this will fail to happen, but alpha is usually set to be small (e.g. 0.1).