Quora-question-pair-similarity

Problem Statement

  • Identify which questions asked on Quora are duplicates of questions that have already been asked.
  • This could be useful to instantly provide answers to questions that have already been answered.
  • We are tasked with predicting whether a pair of questions are duplicates or not.

Data Overview

- Data will be in a file Train.csv
- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate
- Size of Train.csv - 60MB
- Number of rows in Train.csv = 404,290

Type of Machine Leaning Problem

It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.

Performance Metric

  • log-loss
  • Binary Confusion Matrix

Exploratory Data Analysis

Basic Feature Extraction

ML models applied

Binary classification done using Logistic Regression, Linear SVM with hyperparameter tuning and XGBoost model.