/Algorithms

code generation dataset

Algorithms Dataset

Dataset Description

The Algorithms dataset is a collection of code written by humans for 4680 competitive programming problems across 28 algorithm categories. This dataset is unique in that it is the first code dataset specifically collected to evaluate the ability of language models to generate code on different algorithmic tasks, making it a valuable resource for code intelligence research. The dataset includes 75259 correct codes in python programming languages. The Algorithms dataset can be used to test and improve the accuracy of language models in generating code for different algorithmic problems.

Download the Algorithms dataset here. (~X.XGB)

Languages

The dataset contains questions in English and code solutions in Python.

How to use it

pass

Data Fields

pass

Data Splits

The distribution of data across the different algorithm categories in both the training and test sets:

Algorithm Category Training Set Count Test Set Count
Dichotomy pass pass
Shortest Path pass pass
Greedy pass pass
Dynamic Programming pass pass
... ... ...

Dataset Statistics

  • 4680 coding problems
  • 28 algorithm categories
  • 75259 correct codes
  • XXX test cases
  • all problems have a least one test case except XXX samples in the train split
  • for tests split, the average number of test cases is XXX
  • average length of a problem is 504.7 words
  • average length of a correct code written by human is 282.9 words
  • all files have ground-truth solutions except XXX samples in the test split

Dataset Creation

To create the Algorithms dataset, We collected problems with algorithm tags and their corresponding solutions and test cases from online open programming websites such as Codeforces, HackerEarth, Nowcoder, and other platforms. The collected questions were then synthesized and categorized into 28 different algorithm categories, including dfs and binary search. All the questions were reclassified according to these categories. For more details, please wait for our paper.