Data Source: Train.csv file.
Features: qid1, qid2, question1, question2, is_duplicate.
Size: 60MB, 404,290 rows.
TensorFlow, Keras, NLTK, Reg-ex, NumPy, Pandas, Matplotlib, Seaborn, etc.
Load and inspect training and testing data.
Check data's head, tail, shape, and information.
Examine unique questions in the dataset.
Assess data balance.
Utilize automated EDA tools for insights.
Count unique and repeated questions.
Visualize repeated questions distribution.
Define X_train & y_train arrays.
Create X_test & y_test arrays.
Check for missing values and duplicates.
Perform text pre-processing using Keras.
Pad and sequence the text.
Load GloVe word embeddings for semantic representation.
Utilize Long Short-Term Memory (LSTM) for deep learning.
Create separate models for each question.
Merge the model outputs.
- Generate a visual representation of the model.
Use Adam optimizer with sparse categorical cross-entropy loss.
Train the model with specified batch size and epochs.
Visualize training progress using loss and accuracy plots.
Prediction Using Test Data:
Generate predictions using pre-processed test data.
- Save the trained model using .h5 extension.
Understanding the business problem.
Choosing appropriate text processing techniques.
Dealing with lengthy training times.
This outline covers the entire process from data loading to model evaluation, encapsulating key steps and challenges encountered in the project.