Algorithms Dataset

Dataset Description

The Algorithms dataset is a collection of code written by humans for 4680 competitive programming problems across 28 algorithm categories. This dataset is unique in that it is the first code dataset specifically collected to evaluate the ability of language models to generate code on different algorithmic tasks, making it a valuable resource for code intelligence research. The dataset includes 75259 correct codes in python programming languages. The Algorithms dataset can be used to test and improve the accuracy of language models in generating code for different algorithmic problems.

Download the Algorithms dataset here. (~X.XGB)

Languages

The dataset contains questions in English and code solutions in Python.

How to use it

pass

Data Fields

pass

Data Splits

The distribution of data across the different algorithm categories in both the training and test sets:

Algorithm Category	Training Set Count	Test Set Count
Dichotomy	pass	pass
Shortest Path	pass	pass
Greedy	pass	pass
Dynamic Programming	pass	pass
...	...	...

Dataset Statistics

4680 coding problems
28 algorithm categories
75259 correct codes
XXX test cases
all problems have a least one test case except XXX samples in the train split
for tests split, the average number of test cases is XXX
average length of a problem is 504.7 words
average length of a correct code written by human is 282.9 words
all files have ground-truth solutions except XXX samples in the test split

Dataset Creation

To create the Algorithms dataset, We collected problems with algorithm tags and their corresponding solutions and test cases from online open programming websites such as Codeforces, HackerEarth, Nowcoder, and other platforms. The collected questions were then synthesized and categorized into 28 different algorithm categories, including dfs and binary search. All the questions were reclassified according to these categories. For more details, please wait for our paper.

sssszh/Algorithms