amazon-science/cceval

Regarding relevant data to benchmark the performance of the retriever

anmolagarwal999 opened this issue · 1 comments

The paper claims that: CROSSCODEEVAL can also be used to measure the capability of code retrievers. T
None of the files in cross_coder_eval_repo/data/crosscodeeval_data/python/ ie line_completion.jsonl,
line_completion_oracle_bm25.jsonl, line_completion_rg1_bm25.jsonl seem to have information about the repository and commit ID from which they were taken. I want to benchmark a retrieval algorithm R where I want to test the performance of the retriever over the entire repository. There does not seem to be data for this.

Also, Section 2.2 of the paper mentions: "First, we find all intra-project imports in the original file. Next, an empty class is created for each imported name to replace the import statement. Since the imported name now refers to an empty class, any subsequent call to its member function or attribute will raise an undefined name error. We leverage static analysis to catch such errors in the modified file, which precisely correspond to the names in the original file that can only be resolved by cross-file context". Will it be correct to say that the file which has the empty class is the one which contains the ground truth context relevant for the generation task ? Is the data regarding this (ie which file imports were removed) has been provided somewhere ?

\c @wasiahmad @zijwang

Please email us for the raw data.