/automated_essay_scoring

Primary LanguageJupyter NotebookMIT LicenseMIT

Here's a detailed analysis of the provided code:

Imports and Dependencies

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from keras.utils import to_categorical
from keras.layers import *
from keras.models import Model, Sequential
import matplotlib.pyplot as plt
import seaborn as sns
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint

The code uses several libraries:

  • Pandas and NumPy for data manipulation.
  • Scikit-learn for preprocessing and machine learning models.
  • Keras (part of TensorFlow) for building and training neural networks.
  • Matplotlib and Seaborn for data visualization.

Data Loading

dataset = pd.read_csv("../dataset/learning-agency-lab-automated-essay-scoring-2/train.csv")

The dataset is loaded from a CSV file.

Data Exploration and Preprocessing

  1. Basic Information

    dataset.head()
    dataset.dtypes
    dataset.isna().sum()

    Displaying the first few rows, data types, and checking for missing values.

  2. Text Cleaning

    dataset.full_text = dataset.full_text.replace("[^a-zA-Z0-9 ]", "",regex=True)

    Removing non-alphanumeric characters from the text.

  3. Tokenization

    tokenizer = Tokenizer(num_words=10000)

    Initializing a tokenizer that will keep the top 10,000 most frequent words.

  4. Word Count Feature

    def apply_func(x):
        splitted = x.split()
        return len(splitted)
    
    dataset["word_num"] = dataset.full_text.apply(apply_func)

    Creating a new column that counts the number of words in each essay.

Exploratory Data Analysis (EDA)

dataset.describe()
sns.countplot(data=dataset, x="score")
sns.boxenplot(data=dataset, x="score", y="word_num")
sns.regplot(data=dataset, x="score", y="word_num")

Basic statistical descriptions and visualizations:

  • Count plot of scores.
  • Boxen plot and regression plot to explore the relationship between word count and score.

Data Filtering

data = dataset.copy()
data = data.sort_values(by="word_num", ascending=False)
ninety_nine = round(len(data) * 0.01)
data = data.iloc[ninety_nine:, :]
data = data.sort_index().reset_index()
data.drop("index", axis=1, inplace=True)

Filtering out the top 1% of essays with the highest word counts to remove outliers.

Text Preprocessing

texts = data.full_text.values
tokenizer.fit_on_texts(texts)
texts = tokenizer.texts_to_sequences(data.full_text.values)
max_len = max([len(i) for i in texts])
padded = pad_sequences(texts, maxlen=max_len, padding="post")

Converting texts to sequences of integers and padding them to the same length.

Preparing Features and Labels

y = to_categorical(data.score.values, num_classes=7)
scaler = MinMaxScaler()
X = scaler.fit_transform([data.word_num.values])
X = X.reshape(-1, 1)
  • Labels (y): Converting scores to categorical format.
  • Features (X): Scaling word counts.

Model Definition

def create_model():
    input_lstm = Input(shape=(max_len,))
    text_ai_input = Embedding(input_dim=10000, input_length=max_len, output_dim=128)(input_lstm)
    text_ai_lstm_1 = LSTM(128, return_sequences=True)(text_ai_input)
    text_ai_dr1 = Dropout(0.3)(text_ai_lstm_1)
    text_ai_lstm2 = LSTM(128)(text_ai_dr1)
    text_ai_dr2 = Dropout(0.2)(text_ai_lstm2)
    text_ai_dense = Dense(128, activation="relu")(text_ai_dr2)

    linear_model_input = Input(shape=(1, ))
    linear_model_dense = Dense(128, activation="relu")(linear_model_input)
    linear_model_dr = Dropout(0.2)(linear_model_dense)
    linear_model_dense2 = Dense(128, activation="relu")(linear_model_dr)
    linear_model_dr2 = Dropout(0.2)(linear_model_dense2)
    linear_model_dense3 = Dense(128, activation="relu")(linear_model_dr2)

    concated_layer = Concatenate()([text_ai_dense, linear_model_dense3])
    concated_dense = Dense(256, activation="relu")(concated_layer)
    concated_dr = Dropout(0.2)(concated_dense)
    concated_dense2 = Dense(128)(concated_dr)
    out = Dense(7, activation="softmax")(concated_dense2)

    model = Model(inputs=[input_lstm, linear_model_input], outputs=out)
    model.summary()
    return model

Creating a combined neural network model with:

  • Text model: LSTM layers to process the essay texts.
  • Linear model: Dense layers to process the word count feature.
  • Concatenation: Combining the outputs of both models.

Model Training

model = create_model()
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
txt_train, txt_test, X_train, X_test, y_train, y_test = train_test_split(padded, X, y, train_size=0.8, random_state=42)
cp = ModelCheckpoint(filepath="essay", save_best_only=True, mode="min", monitor="val_loss")
model.fit([txt_train, X_train], y_train, epochs=100, validation_data=([txt_test,X_test], y_test), callbacks=[cp])
  • Model Compilation: Using categorical crossentropy loss and Adam optimizer.
  • Data Splitting: Splitting data into training and testing sets.
  • Model Checkpoint: Saving the best model based on validation loss.
  • Model Training: Training the model for 100 epochs.

Model Evaluation and Saving

h = model.history
history_df = pd.DataFrame(h.history).plot(title="Model Saved on 11th epoch")

Plotting the training history.

## REMEMBER
# SAve tokenizer
# save minmaxscaler
# save model

Reminder to save the tokenizer, MinMaxScaler, and model.

Summary

This code provides a complete pipeline for building a machine learning model to predict essay scores based on text content and word count. It includes data loading, preprocessing, exploratory data analysis, model building, training, and evaluation. The model combines text processing with LSTM layers and a linear model for the word count feature, achieving a comprehensive approach to the problem.