Data Source: Train.csv file.
Features: qid1, qid2, question1, question2, is_duplicate.
Size: 60MB, 404,290 rows.
TensorFlow, Keras, NLTK, Reg-ex, NumPy, Pandas, Matplotlib, Seaborn, etc.
-
Load and inspect training and testing data.
-
Check data's head, tail, shape, and information.
-
Examine unique questions in the dataset.
-
Assess data balance.
-
Utilize automated EDA tools for insights.
-
Count unique and repeated questions.
-
Visualize repeated questions distribution.
Define X_train & y_train arrays.
Create X_test & y_test arrays.
-
Check for missing values and duplicates.
-
Perform text pre-processing using Keras.
-
Pad and sequence the text.
-
Load GloVe word embeddings for semantic representation.
-
Utilize Long Short-Term Memory (LSTM) for deep learning.
-
Create separate models for each question.
-
Merge the model outputs.
- Generate a visual representation of the model.
-
Use Adam optimizer with sparse categorical cross-entropy loss.
-
Train the model with specified batch size and epochs.
-
Visualize training progress using loss and accuracy plots.
-
Prediction Using Test Data:
-
Generate predictions using pre-processed test data.
- Save the trained model using .h5 extension.
-
Understanding the business problem.
-
Choosing appropriate text processing techniques.
-
Dealing with lengthy training times.
This outline covers the entire process from data loading to model evaluation, encapsulating key steps and challenges encountered in the project.