Automatic_Essay_Grader

Trained a model to automatically grade the essays available on this dataset https://www.kaggle.com/c/asap-aes

Instructions to test the results:

Train the model by running the train.py file
Test the model by running the test.py file(You can add the test_set results in this file as mentioned in the comments of test.py and evaluate the test set accuracy).

Features considered:
i) Tf-idf matrix(tf-idf)
ii) Relevance with the source essays(relevance)
iii) Relevance with the prompt(relevance_quesn)
iv) Number of words used(word_count)
v) Number of distinct words used(distinct_word_count)
vi) Number of sentences (sentence_count)
vii) Average word size (avg_word_length)
viii) Essay set(essay_set)
(Note: Grammatical mistakes and spelling mistakes are ignored since they were not considered while grading as mentioned in the scoring docs).

Following graph shows the kappa score on essay set1 for dfferent featuresets.

featureset1: tfidf with unigrams, relevance, relevance_quesn, essay_set
featureset2: relevance, relevance_quesn, word_count, distinct_word_count,essay_set, sentence_count, avg_word_length
featureset3: relevance, relevance_quesn, word_count, distinct_word_count,essay_set, sentence_count, avg_word_length(considering 1,2,3,4 grams)

Thus featureset3 looks most promising and is used for all the models.

The data is divided into two sets viz. set a and set b.
Set a contains essaysets 1,3,4,5,6,7,8 and Set b contains essay set 2.
This division is made since the essays in set 2 have 2 predictions to be made i.e domain1 and domain2 scores.

Following graph shows the feature importance of features as evaluated on Set a

Model Selection:
(a) For set A:
Following graph shows the kappa scores of various models trained and tested on essay set 1

And this is the score variation on setA(i,e essay set 1,3,4,5,6,7,8)

(Note: The increase in score is due to the essay_set features whose importance was null when trained on essay sets of single type since the value of essay_set for all data points in such case is same but it got a high importance when trained on all essay_sets containing data points with different essay_set values.)

Owing to the scores in above graph SVC is used as the final model for set A.

(b) For set B:

Models evaluated- Linear regression, SVR
Following graph shows the kappa scores for domain 1 of various models trained and tested on essay set B

Following graph shows the kappa scores for domain 2 of various models trained and tested on essay set B

Owing to the results above SVR was used as the final model for set B.

Final scores on individual sets:

Final scores on overall validation data:

Final kappa score on validation data :0.9825 (Note: All the scores are calculated on the validation data since true labels of test data were not available)

Notes:(for extending the model to grade essays written by more matured writers)
i) Features like visual nature( can be calculated using British Natural Corpus), beautiful words(using Cornell Math Cryptography) and emotive_effectiveness(using MPQA) can be used to score essays written by more matured writers(for e.g. Pulitzer prize essays). Since the essays are written by school kids of grade 8,9,10th on a very short notice these features are not that considerable.
Reference:https://nlp.stanford.edu/courses/cs224n/2013/reports/song.pdf
ii) Spelling errors can be calculated using English words from Python’s NLTK iii) Lexical diversity can be accounted for using the ratio of the count of all lexically important POS tags such as nouns, adjectives and adverbs to the count of all tags.

ethicalrushi/automatic_essay_grader

Automatic_Essay_Grader