Here's a detailed analysis of the provided code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from keras.utils import to_categorical
from keras.layers import *
from keras.models import Model, Sequential
import matplotlib.pyplot as plt
import seaborn as sns
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from sklearn.model_selection import train_test_split
from keras.callbacks import ModelCheckpoint
The code uses several libraries:
- Pandas and NumPy for data manipulation.
- Scikit-learn for preprocessing and machine learning models.
- Keras (part of TensorFlow) for building and training neural networks.
- Matplotlib and Seaborn for data visualization.
dataset = pd.read_csv("../dataset/learning-agency-lab-automated-essay-scoring-2/train.csv")
The dataset is loaded from a CSV file.
-
Basic Information
dataset.head() dataset.dtypes dataset.isna().sum()
Displaying the first few rows, data types, and checking for missing values.
-
Text Cleaning
dataset.full_text = dataset.full_text.replace("[^a-zA-Z0-9 ]", "",regex=True)
Removing non-alphanumeric characters from the text.
-
Tokenization
tokenizer = Tokenizer(num_words=10000)
Initializing a tokenizer that will keep the top 10,000 most frequent words.
-
Word Count Feature
def apply_func(x): splitted = x.split() return len(splitted) dataset["word_num"] = dataset.full_text.apply(apply_func)
Creating a new column that counts the number of words in each essay.
dataset.describe()
sns.countplot(data=dataset, x="score")
sns.boxenplot(data=dataset, x="score", y="word_num")
sns.regplot(data=dataset, x="score", y="word_num")
Basic statistical descriptions and visualizations:
- Count plot of scores.
- Boxen plot and regression plot to explore the relationship between word count and score.
data = dataset.copy()
data = data.sort_values(by="word_num", ascending=False)
ninety_nine = round(len(data) * 0.01)
data = data.iloc[ninety_nine:, :]
data = data.sort_index().reset_index()
data.drop("index", axis=1, inplace=True)
Filtering out the top 1% of essays with the highest word counts to remove outliers.
texts = data.full_text.values
tokenizer.fit_on_texts(texts)
texts = tokenizer.texts_to_sequences(data.full_text.values)
max_len = max([len(i) for i in texts])
padded = pad_sequences(texts, maxlen=max_len, padding="post")
Converting texts to sequences of integers and padding them to the same length.
y = to_categorical(data.score.values, num_classes=7)
scaler = MinMaxScaler()
X = scaler.fit_transform([data.word_num.values])
X = X.reshape(-1, 1)
- Labels (
y
): Converting scores to categorical format. - Features (
X
): Scaling word counts.
def create_model():
input_lstm = Input(shape=(max_len,))
text_ai_input = Embedding(input_dim=10000, input_length=max_len, output_dim=128)(input_lstm)
text_ai_lstm_1 = LSTM(128, return_sequences=True)(text_ai_input)
text_ai_dr1 = Dropout(0.3)(text_ai_lstm_1)
text_ai_lstm2 = LSTM(128)(text_ai_dr1)
text_ai_dr2 = Dropout(0.2)(text_ai_lstm2)
text_ai_dense = Dense(128, activation="relu")(text_ai_dr2)
linear_model_input = Input(shape=(1, ))
linear_model_dense = Dense(128, activation="relu")(linear_model_input)
linear_model_dr = Dropout(0.2)(linear_model_dense)
linear_model_dense2 = Dense(128, activation="relu")(linear_model_dr)
linear_model_dr2 = Dropout(0.2)(linear_model_dense2)
linear_model_dense3 = Dense(128, activation="relu")(linear_model_dr2)
concated_layer = Concatenate()([text_ai_dense, linear_model_dense3])
concated_dense = Dense(256, activation="relu")(concated_layer)
concated_dr = Dropout(0.2)(concated_dense)
concated_dense2 = Dense(128)(concated_dr)
out = Dense(7, activation="softmax")(concated_dense2)
model = Model(inputs=[input_lstm, linear_model_input], outputs=out)
model.summary()
return model
Creating a combined neural network model with:
- Text model: LSTM layers to process the essay texts.
- Linear model: Dense layers to process the word count feature.
- Concatenation: Combining the outputs of both models.
model = create_model()
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
txt_train, txt_test, X_train, X_test, y_train, y_test = train_test_split(padded, X, y, train_size=0.8, random_state=42)
cp = ModelCheckpoint(filepath="essay", save_best_only=True, mode="min", monitor="val_loss")
model.fit([txt_train, X_train], y_train, epochs=100, validation_data=([txt_test,X_test], y_test), callbacks=[cp])
- Model Compilation: Using categorical crossentropy loss and Adam optimizer.
- Data Splitting: Splitting data into training and testing sets.
- Model Checkpoint: Saving the best model based on validation loss.
- Model Training: Training the model for 100 epochs.
h = model.history
history_df = pd.DataFrame(h.history).plot(title="Model Saved on 11th epoch")
Plotting the training history.
## REMEMBER
# SAve tokenizer
# save minmaxscaler
# save model
Reminder to save the tokenizer, MinMaxScaler, and model.
This code provides a complete pipeline for building a machine learning model to predict essay scores based on text content and word count. It includes data loading, preprocessing, exploratory data analysis, model building, training, and evaluation. The model combines text processing with LSTM layers and a linear model for the word count feature, achieving a comprehensive approach to the problem.