amazon-science/cceval

Raw code projects of the curated data

ganler opened this issue · 1 comments

ganler commented

Hi, thanks for the amazing new work and congratulations on the NeurIPS acceptance!

From my current understanding, the prompts for the cceval datasets (e.g., data/crosscodeeval_data/python/line_completion{*}.jsonl are pre-processed and thus fixed. I am interested to see the actual full contexts (as if a developer can access all code within the project under development), which would broaden the use of cceval to evaluate some retrieval-based technique (e.g., self-RAG).

I found there is a metadata field for each of the line completions such as:

{'task_id': 'project_cc_python/1584',
 'repository': 'obahamonde-aiofauna-67993d2',
 'file': 'aiofauna/llm/schemas.py',
 'context_start_lineno': 0,
 'groundtruth_start_lineno': 85,
 'right_context_start_lineno': 86}

I am assuming obahamonde-aiofauna-67993d2 includes the information of owner, project and commit id to indicates the exact raw project. Nonetheless, I am curious if there is any plans on supporting a more low-level dataset format for cceval where each item can include the whole structure of the project (e.g., may use a docker image to pre-install all projects and point out the project root path). Thanks!

Please email us for this. Thanks.