pass@k on filtered samples
henryhungle opened this issue · 0 comments
Hi,
Thank you for the great work!
I have 2 questions about the computation of the pass@k metric after applying filtering on the APPS benchmark.
-
Will the
total
array in the below code snippet contain numbers of filtered samples that passed the example test cases (from problem statement), i.e. each number <= N_original_samples(=1000)?
human-eval/human_eval/evaluation.py
Line 85 in 312c5e5
-
In the cases when a number of filtered samples is less than k (=[1,5]), how do you compute the pass@k metric for these cases? For example, when N_filtered_samples = 1 and k=5, can we assume execution results of 4 failures and 1 passed/failure (depending on the final unit test results of this filtered sample)?