free-response-scoring
This repository shares code used to implement the methods described in Unsupervised Machine Scoring of Free Response Answers—Validated Against Law School Final Exams, presented at the Computational Legal Studies Conference, March 2022, hosted by the Center for Computational Law at Singapore Management University.
You can find links to all relevant content either in, or linked to from, the notebook titled Score Exams.
Some of the data presented here differ slightly from that found in the version presented at the CLS Conference and in the slide deck above. This is due to additional work done in response to feedback received after presenting the paper. They now show the difference in performance between pseudo-random and machine ordering after both machine and human markings are converted into z-scores. The translation of both scores into z-scores allowed the machine score to be compared with the human score using intraclass correlation ICC and Cohen's kappa. The CLS presentation compared the machine scores only to pseudo-random scores without translating the human scoring and after the machine score was transformed into a numerical score based on a standard grading scale (e.g., 90, 80, etc.). Older versions of this notebook with prior results can be found here.
Paper Summary
This paper presents a novel method for unsupervised machine scoring of short answer and essay question responses, relying solely on a sufficiently large set of responses to a common prompt, absent the need for pre-labeled sample answers—given said prompt is of a particular character. That is, for questions where “good” answers look similar, “wrong” answers are likely to be “wrong” in different ways. Consequently, when a collection of text embeddings for responses to a common prompt are placed in an appropriate feature space, the centroid of their placements can stand in for a model answer, providing a lodestar against which to measure individual responses. This paper examines the efficacy of this method and discusses potential applications.
Current methods for the automated scoring of short answer and essay questions are poorly suited to spontaneous and idiosyncratic assessments. That is, the time saved in grading must be balanced against the time required for the training of a model. This includes tasks such as the creation of pre-labeled sample answers. This limits the utility of machine grading for single classes working with novel assessments. The method described here eliminates the need for the preparation of pre-labeled sample answers. It is the author’s hope that such a method may be leveraged to reduce the time needed to grade free response questions, promoting the increased adoption of formative assessment esp. in contexts like law school instruction which traditionally have relied almost exclusively on summative assessments.
Ranking by the algorithm is found to be statistically significant when compared to a pseudo-random shuffle. To determine how similar a list’s order was to that produced by a human grader, the lowest number of neighbor swaps needed to transform the ordering of these lists into that of the human ordering was calculated. For a dataset including more than one thousand student answers to a set of thirteen free response questions, drawn from six Suffolk University Law School final exams, taught by five instructors, the p-value for a paired t-test of the two populations’ swaps, with the pseudo-random group acting as the untreated group and the machine-grader acting as the treatment, came to 0.000000334, allowing us to reject the null hypothesis that the machine's ordering is equivalent to a random shuffle. Additionally, the Cohen’s d for the number of swaps between the pseudo-random ordering and machine ordering was found to be large (i.e., 1.03).