This project aims to build a system for learning question similarity using Siamese Neural Networks. The goal is to address the question retrieval task by identifying whether a given pair of questions is similar or dissimilar. This is particularly valuable in platforms like Quora and Stack Overflow, where users can benefit from finding existing answers to similar questions.
The project utilizes the Quora Question Pairs Dataset for training and testing the model. The dataset can be downloaded from this link.
In the initial analysis notebook, you performed an overview of the project and dataset. Here are some key findings:
- Input: Two free-text fields, presumably questions.
- Output: A similarity score between 0 and 1, indicating the degree of similarity between the questions.
- 400k question pairs.
- 150k examples of duplicate questions.
- 250k examples of non-duplicate questions.
- 350k distinct questions - some questions are frequently repeated.
In the main code notebook, you implemented the core components of your project. Here's a breakdown of the major sections:
The text preprocessing steps include:
- Changing all words to lowercase.
- Cleaning punctuations and translating abbreviations.
- Tokenizing the words.
- Removing stop words.
- Removing numbers.
- Creating a dictionary of unique words and their corresponding numbers.
In this part, you prepared the embedding layer and defined the Siamese Network architecture. You used a pre-trained GloVe embedding layer to convert words into 300-dimensional vectors. The architecture involves:
- Input layers for question pairs.
- Embedding layers with pre-trained embeddings.
- Bidirectional LSTM layers for both questions.
- Dot layer for cosine similarity.
- Batch normalization and dropout layers.
- Dense layers for prediction.
You trained the Siamese Network model for 5 epochs, achieving a validation accuracy of approximately 82.36%. The training involved optimizing the model using the Binary Cross-Entropy loss function.
-
Initial Analysis: Review the findings and insights from the initial analysis notebook to understand the dataset and project requirements.
-
Main Code: Follow the steps in the main code notebook to preprocess text, prepare the embedding layer, define the Siamese Network architecture, and train the model.
-
Model Evaluation: Assess the model's performance on your specific problem and dataset. Experiment with hyperparameters and architectures for potential improvements.
-
Inference: Use the trained model for predicting the similarity of new question pairs.
Manhattan Siamese LSTM for Question Retrieval in Community Question Answering