bigcode-project/bigcode-dataset

Decontaminate pretraining dataset from evaluation benchmarks

lvwerra opened this issue · 0 comments

In order to make sure that the evaluation results reflect the true performance of the model, it is important to make sure that the evaluation benchmarks are not part of the training data. For that purpose we want to use #15 to search and remove evaluation benchmarks from the code dataset.