Recent advances in machine learning have benefited a number of code related tasks, such as code translation, code summarization, and code synthesis. Open-source code repository websites like Github provide enormous amount of source code data, which enables the training of large-scale code language models such as CodeBERT (Feng et al., 2020), PLBART (Ahmad et al., 2021a), TransCoder (Roziere et al., 2020) and CodeT5 (Wang et al., 2021). Although the open-source code data is abundant in quantity, it has several disadvantages when serving as training data for code-related models. First, most of the available code data is unlabeled. For tasks like Code Translation, Code Summarization, and Code Synthesis, high quality parallel data is critical for model training.
We introduce XLCoST , a machine learning benchmark dataset that contains fine-grained parallel data in 7 commonly used programming languages (C++, Java, Python, C#, Javascript, PHP, C), and natural language (English). The data is parallel across 7 languages, at both code snippet level and program level. This means that given a program in one language, the dataset contains the same program in up to 6 other programming languages. Each program is divided into several code snippets, and programs in all the languages are aligned at the snippet level. Moreover, each of the snippets is accompanied with a comment, and the comment for a particular snippet is the same across all the languages. Please find the full paper here.
The figure below shows a schematic diagram of how the dataset is organised and the possible tasks that can be performed with it.
We introduce the following 10 cross-lingual tasks. All the tasks have pairwise data at both snippet-level and program-level in 7 programming languages, C++, Java, Python, C#, Javascript, PHP, and C. The tasks can be divided into two categories, generation and retrieval. The generation tasks include Code Translation, Code Summarization and Code Syntheis; the retrieval tasks include NL (natural language) Code Search and XL (Cross-Lingual) Code Search. All the tasks are in both snippet-level and program-level. We use 3 state-of-the-art baselines for the generation tasks and 2 for the retrieval tasks.
Category | Task | Data | Description | Baselines | |
---|---|---|---|---|---|
Generation | Code-to-Code | Snippet Translation | 872K/47K/83K | Translate code snippet across programming languages | CodeBERT(enc-dec), PLBART, CodeT5 |
Program Translation | 106K/6K/11K | Translate program across programming languages | |||
Code-to-Text | Snippet Summarization | 446K/22K/41K | Generate comment for given code snippet | ||
Program Summarization | 50K/3K/5K | Generate problem description for given program | |||
Text-to-Code | Snippet Synthesis | 446K/22K/41K | Generate code snippet giving comment | ||
Program Synthesis | 50K/3K/5K | Generate program giving problem description and comments | |||
Retrieval | NL Code Search | Comment-to-Snippet Search | 446K/22K/41K | Retrieve code snippet for given comment | RoBERTa, CodeBERT |
Problem-to-Program Search | 50K/3K/5K | Retrieve program for given problem description | |||
XL Code Search | Snippet-to-Snippet Search | 872K/47K/83K | Retrieve code snippets in other languages for given snippet | ||
Program-to-Program Search | 106K/6K/11K | Retrieve programs in other languages for given snippet |
Use the requirements.txt
file to setup your environment.
Code for this repository has been adapted from CodeXGLUE and PLBART.
Instructions to run the generation tasks can be found here.
Instructions to run the code search tasks can be found here.
The data can be downloaded here.
Details about the data files and metadata can be found here.
Some basic averaged statistics of the dataset are presented below. "#" means number. #comments/program is the same as #snippets/program. (Py is short for Python; JS for Javascript; TOK for tokens; SN for snippets; PR for programs; com for comments;)
C++ | Java | C# | Python | JS | PHP | C | Avg | |
---|---|---|---|---|---|---|---|---|
# tokens/snippet | 21.52 | 24.1 | 21.63 | 23.06 | 22.52 | 28.14 | 25.37 | 22.83 |
# tokens/program | 204.97 | 227.09 | 188.54 | 215.29 | 184.63 | 163.51 | 197.95 | 201.96 |
# tokens/comment | 8.25 | 8.14 | 7.97 | 8.23 | 7.96 | 8.45 | 9.67 | 8.15 |
# tokens/desc | 10.68 | 10.67 | 10.75 | 10.7 | 10.87 | 9.91 | 8.19 | 10.66 |
# snippet/program | 9.52 | 9.42 | 8.51 | 9.33 | 8.2 | 5.81 | 7.77 | 8.81 |
# lines/snippet | 3.41 | 3.71 | 2.41 | 3.82 | 3.23 | 4 | 4.05 | 3.37 |
# lines/program | 32.45 | 34.93 | 20.54 | 35.64 | 26.47 | 23.23 | 31.5 | 29.71 |
total snippets | 106,397 | 103,703 | 92,446 | 100,032 | 81,511 | 20,639 | 4,363 | - |
total programs | 11,198 | 11,028 | 10,622 | 10,735 | 9,951 | 3,553 | 574 | - |
Number of pairwise code-code data in training, validation and testing splits for each language-pair are presented in the following table. The upper triangle shows the number of parallel code snippets, and the lower triangle shows the number of parallel programs. This data is used for the Code Translation and XL Code Search tasks. (Py is short for Python. JS is short for Javascript.)
Code-Code Pairs | C++ | Java | Python | C# | JS | PHP | C | |
---|---|---|---|---|---|---|---|---|
C++ | train | 89,040 | 80,100 | 85,662 | 69,507 | 17,811 | 3,386 | |
val | 4,419 | 3,913 | 4,408 | 3,808 | 923 | 352 | ||
test | 8,059 | 7,228 | 7,922 | 6,965 | 1,647 | 222 | ||
Java | train | 9,450 | 77,759 | 87,065 | 69,341 | 17,853 | 2,996 | |
val | 490 | 3,938 | 4,437 | 3,826 | 929 | 353 | ||
test | 901 | 7,259 | 8,011 | 7,005 | 1,672 | 238 | ||
Python | train | 9,139 | 8,991 | 75,843 | 67,219 | 17,616 | 2,478 | |
val | 468 | 471 | 3,922 | 3,750 | 923 | 311 | ||
test | 878 | 882 | 7,215 | 6,861 | 1,655 | 203 | ||
C# | train | 9,187 | 9,301 | 8,826 | 68,093 | 17,873 | 2,958 | |
val | 488 | 491 | 470 | 3,826 | 928 | 352 | ||
test | 890 | 898 | 877 | 6,961 | 1,668 | 238 | ||
JS | train | 8,482 | 8,470 | 8,182 | 8,367 | 17,117 | 1,875 | |
val | 472 | 475 | 459 | 475 | 921 | 309 | ||
test | 878 | 881 | 864 | 877 | 1,617 | 200 | ||
PHP | train | 3,056 | 3,68 | 3,003 | 3,071 | 2,971 | 856 | |
val | 157 | 158 | 153 | 158 | 157 | 271 | ||
test | 303 | 307 | 304 | 307 | 302 | 183 | ||
C | train | 402 | 409 | 380 | 394 | 308 | 170 | |
val | 59 | 59 | 59 | 59 | 59 | 55 | ||
test | 45 | 49 | 48 | 49 | 49 | 43 |
Number of pairwise code-text data in each language are presented in the table below. "Snippet" means snippet-comment pairs, and "Program" means program-description (problem description) pairs. This data is used for Code Summarization (Code-to-Text), Code Synthesis (Text-to-Code) and NL Code Search tasks.
NL-Code Pairs | C++ | Java | Python | C# | JS | PHP | C | Total | |
---|---|---|---|---|---|---|---|---|---|
Snippet | train | 93,847 | 91,089 | 81,207 | 87,583 | 70,649 | 18,027 | 3,763 | 446,165 |
valid | 4,432 | 4,460 | 3,946 | 4,436 | 3,829 | 930 | 350 | 22,383 | |
test | 8,118 | 8,154 | 7,293 | 8,013 | 7,033 | 1,682 | 250 | 40,543 | |
Program | train | 9,797 | 9,623 | 9,263 | 9,345 | 8,590 | 3,087 | 463 | 50,168 |
valid | 492 | 494 | 472 | 491 | 475 | 158 | 60 | 2,642 | |
test | 909 | 911 | 887 | 899 | 886 | 308 | 51 | 4,851 |
With the release of this dataset hope to enable more research into the domain of Deep Learning for Software Engineering tasks. We believe that this dataset is a valuable asset for the research community and can potentially benefit a number of code-related research problems.
If you use this dataset in your work, please consider citing us. The arXiv version of the paper can be found here.
@misc{zhu2022xlcost,
title = {XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence},
url = {https://arxiv.org/abs/2206.08474},
author = {Zhu, Ming and Jain, Aneesh and Suresh, Karthik and Ravindran, Roshan and Tipirneni, Sindhu and Reddy, Chandan K.},
year = {2022},
eprint={2206.08474},
archivePrefix={arXiv}
}