The Hewlett Foundation has provided a set of high school student essays along with scores generated by human expert graders. The initial data was released in 20121 as part of a Kaggle competition to produce an automated student assessment algorithm to closely match the human scores. Scores are evaluated with the quadratic weighted kappa error metric, which measures the agreement between two raters.
Since then, a few teams have published their attempts to match or improve on the original challenge. Initial success was somewhat disappointing2 as the Kappa scores were around 0.5, but improved significantly by incorporating modern NLP with neural networks3.
Sadly, it is no longer possible to submit kernels and the human graded scores for the validation and test sets have not be made public. Therefore it is perhaps unreasonable to compare kappa scores from the training data set with Kaggle leaderboard where models were trained on the full training data set, validated on the validation set and evaluated on the test data set.
Further confounding the issue, some published kappa scores are based on a subset of essay topics4, or evaluated by combining all scores together instead of individually by topic.
My goal is to see if current NLP algorithms can improve upon the 2012 attempts. Additionally, can the essay and score combinations be used to automatically generate essays?
Assessment can be performed in a variety of ways. For example, in an unsupervised approach, topic modeling can be performed to assign scores based on derived word probabilities.
A supervised approach is possible using machine learning on extracted features, such as named entities, syntax or labelled dependencies.
Neural networks have the advantage of working very well with word embeddings and their typically large number of features (dimensions).
Essay generation can be performed with recurrent neural network algorithms.
The data is provided as separate training, test and validation sets. The training data contains about 1700 essays for each of 7 topics and about 500 essay for an eighth topic. Essays are either source dependent responses or persuasive/narrative/expository on a given topic. The code is executed across four notebooks:
1 EDA and Topic Modeling with LDA
2 Automatic Scoring with Machine Learning
3 Automatic Scoring with Neural Networks
4 Automatic Student Essay Generation
This problem set has commercial impact far beyond student assessment and many applications can be tackled with nearly the same approach. For example:
-
Given a set of financial documents, which one should a manager read first?
-
Which products can be effectively marketed to users based on their social media postings?
-
Detection of fake news vs real news.
-
Sentiment analysis on a graded scale, e.g. very upset - upset - satisfied - happy - very happy.
The original Kaggle competition can be found here:
https://www.kaggle.com/c/asap-aes/data
A selection of published work on the Kaggle ASAP data is given below:
1 https://www.kaggle.com/c/asap-aes
2 https://nlp.stanford.edu/courses/cs224n/2013/reports/song.pdf
3http://aclweb.org/anthology/D/D16/D16-1193.pdf
5 https://github.com/m-chanakya/AutoEssayGrading/blob/master/papers/paper1.pdf
6 http://dspace.bracu.ac.bd/xmlui/bitstream/handle/10361/5399/12101114.pdf?sequence=1&isAllowed=y