CommonLit Readability Prize

CommonLit, Inc., is a nonprofit education technology organization serving over 20 million teachers and students with free digital reading and writing lessons for grades 3-12. With Georgia State University, an R1 public research university launched this Kaggle competition on May 3, 2021 to improve redability rating methods.

We will be predicting the reading ease of excerpts from literature.

Dataset

Files

train.csv - the training set
test.csv - the test set
sample_submission.csv - a sample submission file in the correct format

Columns

id - unique ID for excerpt
url_legal - URL of source - this is blank in the test set.
license - license of source material - this is blank in the test set.
excerpt - text to predict reading ease of
target - reading ease
standard_error - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

BERT

In this project, BERT model is used to predict the readability of text samples BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language model developed by Google. It is a deep learning model that is able to capture the context of words in a sentence in both forward and backward directions. This allows the model to understand the relationships between words in a sentence and produce highly accurate predictions.

In this project, we use BER-Base-Uncased to fine-tune on the CommonLit Readability Prize dataset. BERT Base Uncased is a pre-trained BERT model that was trained on lower-cased English text. It has:

Model	Layers	Hidden Units	Parameters
BERT Base Uncased	12	768	110 million

The "uncased" part of the name means that the model was trained on text that was converted to lower case. This means that the model treats "hello" and "Hello" as the same word, and reduces the size of the vocabulary.

Accuracy Metric

The accuracy metric used in this project is Mean Squared Error (MSE). The MSE is a measure of how well a regression model is able to predict the outcome variable. It is calculated by taking the average of the squared differences between the predicted values and the actual values. A lower MSE indicates better performance of the model.

Loss Functions

The loss function is a measure of how well the model is able to predict the outcome variable during the training process. In this project, the loss function used is Mean Squared Error (MSE) as well. During training, the model is optimized to minimize the MSE loss.

brijes-h/CommonLit-Readability-Prize--using-BERT