Raw code projects of the curated data
ganler opened this issue · 1 comments
Hi, thanks for the amazing new work and congratulations on the NeurIPS acceptance!
From my current understanding, the prompts for the cceval
datasets (e.g., data/crosscodeeval_data/python/line_completion{*}.jsonl
are pre-processed and thus fixed. I am interested to see the actual full contexts (as if a developer can access all code within the project under development), which would broaden the use of cceval to evaluate some retrieval-based technique (e.g., self-RAG).
I found there is a metadata
field for each of the line completions such as:
{'task_id': 'project_cc_python/1584',
'repository': 'obahamonde-aiofauna-67993d2',
'file': 'aiofauna/llm/schemas.py',
'context_start_lineno': 0,
'groundtruth_start_lineno': 85,
'right_context_start_lineno': 86}
I am assuming obahamonde-aiofauna-67993d2
includes the information of owner, project and commit id to indicates the exact raw project. Nonetheless, I am curious if there is any plans on supporting a more low-level dataset format for cceval
where each item can include the whole structure of the project (e.g., may use a docker image to pre-install all projects and point out the project root path). Thanks!
Please email us for this. Thanks.