openai/human-eval

pass@k on filtered samples

henryhungle opened this issue · 0 comments

Hi,

Thank you for the great work!

I have 2 questions about the computation of the pass@k metric after applying filtering on the APPS benchmark.

  1. Will the total array in the below code snippet contain numbers of filtered samples that passed the example test cases (from problem statement), i.e. each number <= N_original_samples(=1000)?

    total = np.array(total)

  2. In the cases when a number of filtered samples is less than k (=[1,5]), how do you compute the pass@k metric for these cases? For example, when N_filtered_samples = 1 and k=5, can we assume execution results of 4 failures and 1 passed/failure (depending on the final unit test results of this filtered sample)?