Citation for the LeetCode Dataset
Closed this issue · 1 comments
I'm curious what the citation for the LeetCode dataset is, and how the dataset was built
Hello Jose!
The current leetcode dataset is a bit problematic. I have sourced the solutions from a huggingface repo that doesn't exist anymore. After careful analysis, I found that some solutions were incorrect, making evaluation quite flaky.
Anyways, the process I did to convert the solutions into a MultiPL-E eval is the following:
- I identified the root function for each solution by generating a control graph and picking the function with no dependents. If for some reason there were more than one function with no dependents, I discarded the whole item.
- I transplanted all helper functions and made them local functions to the root function.
- I generated unit tests for the root function using GPT-4
Another problem with the dataset is that most of these are in the training data of models.
I'm working on a better leetcode dataset on this branch: https://github.com/nuprl/MultiPL-E/tree/new-leetcode
This is based on LeetCode contests solutions that have been verified. Sourced from: https://github.com/deepseek-ai/DeepSeek-Coder/blob/main/Evaluation/LeetCode/data/20240121-Jul.jsonl
These are all leetcode problems released after Jan 2024.
You can currently use it if you'd like. The reason I haven't merged it yet is because I want to hand verify the solutions myself.
Let me know if you have other questions.