Artifacts of our paper published at NeurIPS 2023 Generative AI in Education (GAIED) workshop
For detailed insights and methodologies, you can access our paper titled "Improving the Coverage of GPT for Automated Feedback on High School Programming Assignments" through the following link.
If you find our research or dataset useful, please consider citing it using the BibTeX entry below:
@inproceedings{sahai2023improving,
title={Improving the Coverage of GPT for Automated Feedback on High School Programming Assignments},
author={Sahai, Shubham and Ahmed, Umair Z and Leong, Ben},
booktitle={Proceedings of NeurIPS’23 Workshop on Generative AI for Education (GAIED)},
year={2023}
}
This dataset contains 69
programming assignments and 366
buggy student submissions from a large public school, along with GPT generated feedback that is manually verified for correctness.
It is structured into Excel sheets as follows:
- Summary
- Problems
- Testcases
- Buggy Submissions
- Correct Submissions
- GPT Feedback
Below is a detailed explanation of each sheet, along with the meanings of their respective column names.
This sheet contains the following metrics:
Dataset size
: Total number of problems and submissions in the dataset.Repair coverage
: The proportion of submissions for which repairs were successful across multiple iterations.Feedback confusion matrix @1
: A confusion matrix providing insights into the quality of the first feedback provided by the GPT model.Feedback quality
: Measures the quality of the feedback, both for the first iteration (@1) and over multiple iterations (@N).
This sheet contains information about the programming problems assigned to high-school students.
pid
: Problem ID - A unique identifier for each problem.description
: Description of the problem.prefix
: Code prefix that is common across all solutions for this problem. This code should be added to the front of the student submission before evaluation.suffix
: Code suffix that is common across all solutions for this problem. This code should be appended to the student submission before evaluation.
This sheet holds the test cases used to evaluate the student submissions.
pid
: Problem ID - Links the test case to its corresponding problem.tid
: Testcase ID - A unique identifier for each test case.input
: Left-hand side (LHS) of the expression used in the test case.output
: Right-hand side (RHS) of the expression that the LHS is compared against.
This sheet contains student submissions with bugs or errors.
pid
: Problem ID - Indicates the problem to which the submission is related.sid
: Submission ID - A unique identifier for each student submission.code
: The actual code submitted by the student.failing_testcases
: Test cases that the submitted code fails to pass.passing_testcases
: Test cases that the submitted code passes.
This sheet contains correct student submissions that pass all testcases.
pid
: Problem ID - Indicates the problem to which the submission is related.sid
: Submission ID - A unique identifier for each student submission.code
: The actual code submitted by the student.
This sheet details the repair and feedback generated by a GPT model.
These sheets are of the form gpt<X>_<N>
, where <X>
takes values 3
or 4
indicating GPT3.5T or GPT4 model respectively, and <N>
takes values 1
or N
indicating single or multiple iterations respectively.
pid
: Problem ID - A unique identifier for each problem.sid
: Submission ID - A unique identifier for each student submission.student_code
: The original code submitted by the student.repaired_code
: The code after GPT model's attempted repair.feedback
: Specific feedback or comments provided by the GPT model for specific line-numbers, in a JSON format.num_iteration
: Number of iterations taken by the GPT model for successful repair, with aMAX
of 5.isRepairSuccess
: Indicates whether the GPT model successfully repaired the code."# feedback"
: Number of feedback generated."# TP Valid"
: Number of valid true positive feedback."# FP Extra"
: Number of false positives that are extra or optimizations."# FP Invalid"
: Number of false positives that are invalid and don't contribute to successful repair."# FP Hallucinate"
: Number of false positives that are hallucinations - feedback that is fabricated or unrelated to the student's code."# FN Missed"
: Number of false negatives or missed error detections.
- Empty GPT output: Some of the GPT generated output is empty due to (a) incorrect output JSON formatting, or (b) exceeding the maximum token limit or time limit. While both these cases have been accounted for in calculating the 'Repair Coverage' metric, case-(b) has been deliberately excluded from the manual 'Feedback Quality' process. This is to ensure accurate assessment of the GPT model's performance, by not penalizing them unnecessarily through increasing the 'False Negative' counts.
For any clarifications on the dataset, please contact the authors Shubham Sahai and Umair Z. Ahmed over email. Your feedback would be much appreciated.