- Predict if the questions asked in Quora are duplicates of the existing questions.
- This could be useful to instantly provide answers to questions that have already been answered.
- Binary Classification problem: In this case study, we predict whether a pair of questions are duplicates or not.
- The cost of a mis-classification can be high. (Customer disappointment if non-duplicates are classified as duplicates)
- Provide the probability that a pair of questions can be duplicates so that a custom threshold can be chosen.
- No-strict latency requirements (can take few seconds)
- Interpretability is partially important. (Not needed for the Customer)
- Binary Cross entropy
- Binary confusion matrix
- Data in train.csv
- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate
- Size of Train.csv - 60MB
- Number of rows in Train.csv = 404,290
- qid1 : unique ID of question number 1
- qid2 : unique ID of question number 2
- question1 : Textual content of question1
- question2 : Textual content of question2
- is_duplicate : target variable
- 0 : Pair of questions are not duplicate
- 1 : Pair of questions are duplicate
Train-Test Split: We build train and test by randomly splitting in the ratio of 70:30