/Is-that-a-duplicate-Quora-question

"Is that a duplicate Quora question?" is a Machine Learning Project which helps to Identify which questions asked on Quora are duplicates of questions that have already been asked.

Primary LanguageJupyter Notebook

img

Table of Content

Overview

Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.

Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Credits: Kaggle

Motivation

What could be a perfect way to utilize unfortunate lockdown period? The only solutions to handle the situation are definitely among one of the smart ways to utilize the time industriously. Like most of you, I spend my time in YouTube, Netflix, coding and reading some research papers on weekends. The idea of classifying “Is that a duplicate Quora question?” struck to me when I was browsing through some research papers. Specially, when I found a You Tube video of Kaggle grandmaster “Abhishek Thakur” about this topic. I find a relevant research paper associated with it. And that led me to collect the Dataset of “Is that a duplicate Quora question?” to train a Machine learning model.

Sources/Useful Links

Problem Statement

  • Identify which questions asked on Quora are duplicates of questions that have already been asked.
  • This could be useful to instantly provide answers to questions that have already been answered.
  • We are tasked with predicting whether a pair of questions are duplicates or not.

Solution

Suppose we have a fairly large data set of question-pairs that has been labeled (by humans) as “duplicate” or “not duplicate.” We could then use natural language processing (NLP) techniques to extract the difference in meaning or intent of each question-pair, use machine learning (ML) to learn from the human-labeled data, and predict whether a new pair of questions is duplicate or not.

Which type of ML Problem is this?

#FF5733 It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.

What is the best performance metric for this Problem?

  • log-loss: https://www.kaggle.com/wiki/LogarithmicLoss
    • Qns: Why log-loss is right Metric for this??

      Ans: This is a “Binary class classification problem” this doesn’t mean we want output as “0” or “1”. we want “ p (q1 ≈ q2) “ and here probability lies b/w “0 to 1”, and when we have probability value and predicting for binary class classification problem the log-loss is one of the best metric.

  • Binary Confusion Matrix

Business Objectives and Constraints

  1. The cost of a mis-classification can be very high.
  2. You would want a probability of a pair of questions to be duplicates so that you can choose any threshold of choice.
    • Qsn: Why we choose any threshold of choice??

      Ans: This mean, see we want “p (q1 ≈ q2)“ and here probability lies b/w “ 0 to 1”, so here we can choose some threshold which confirm me “ q1 ≈ q2 ”.

    • Example: If we choose threshold 0.95, this mean p(q1 ≈ q2) when p>0.95.
    • Benefit of choosing threshold here: If suppose we set threshold >0.95 and Human read the answer and they told this is the wrong answer for this question, then we can change the threshold.
  3. No strict latency concerns.
  4. Interpretability is partially important.

Data Overview

  • Data will be in a file Train.csv
  • Train.csv contains 5 columns: qid1, qid2, question1, question2, is_duplicate
  • Number of rows in Train.csv = 404,290

Example Data point

id qid1 qid2 question1 question2 is_duplicate
0 1 2 What is the step by step guide to invest in share market in India? What is the step by step guide to invest in share market? 0
1 3 4 What is the story of Kohinoor (Koh-i-Noor) Diamond? What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back? 0

Train and Test ratio

#FF5733 We build train and test by randomly splitting in the ratio of 60:40 or 70:30 whatever we choose as we have sufficient points to work with.

Agenda

1. Analyzing the Data (EDA)

  • Some Analysis on Train Data Set below:

  • Getting Deep knowledge of Data Set (on question parameter)

    • Output:

      • (1). Total number of questation pairs for training:- 404290
      • (2). Questation pairs are not similar (is_duplicate= 0) in percentage:- 63.08%
      • (3). Questation pairs are similar (is_duplicate= 1) in percentage:- 36.92%
    • Plotted above detail’s on graph:

        df_train.groupby("is_duplicate")["id"].count().plot.bar()

      p1

      We can clearly see this graph and analyze it, positive class (is_duplicate=0) has more pair question than negative class (is_duplicate=1). We can think this as unbalanced data set.

  • Now, Getting Deep knowledge about Number of unique questions:

    • Output:

      • (1). Total number of Unique Questions are: - 537933
      • (2). Number of unique questions that appear more than one time: - 111780 (20.7%)
      • (3). Max number of times a single question is repeated:- 157
    • Plotting Number of occurrences of each question: p2

      In terms of questions, most questions only appear a few times, with very few questions appearing several times (and a few questions appearing many times). One question appears more than 157 times.

2. Basic Feature Extraction (before cleaning the data)

  • Basic Features - Extracted some simple features before cleaning the data as below.
    • freq_qid1 = Frequency of qid1's
    • freq_qid2 = Frequency of qid2's
    • q1len = Length of q1
    • q2len = Length of q2
    • q1_n_words = Number of words in Question 1
    • q2_n_words = Number of words in Question 2
    • word_Common = (Number of common unique words in Question 1 and Question 2)
    • word_Total = (Total num of words in Question 1 + Total num of words in Question 2)
    • word_share = (word_common)/(word_Total)
    • freq_q1+freq_q2 = sum total of frequency of qid1 and qid2
    • freq_q1-freq_q2 = absolute difference of frequency of qid1 and qid2

3. Advanced Feature Extraction (NLP and Fuzzy Features, after preprocessing the Data)

  • Before creating advanced feature, I did some preprocessing on text data.
  • Function to Compute and get the features: With 2 parameters of Question 1 and Question 2.
  • Before getting deep knowledge about advanced feature we need to understand some terms which helps us to understand advance feature sets below.
  • #FF5733Definition or terms:
    • Token: You get a token by splitting sentence a space
    • Stop_Word : stop words as per NLTK.
    • Word : A token that is not a stop_word
  • #FF5733Features:
    • cwc_min : Ratio of common_word_count to min lenghth of word count of Q1 and Q2

      cwc_min = common_word_count / (min(len(q1_words), len(q2_words))
    • cwc_max : Ratio of common_word_count to max lenghth of word count of Q1 and Q2

      cwc_max = common_word_count / (max(len(q1_words), len(q2_words))
    • csc_min : Ratio of common_stop_count to min lenghth of stop count of Q1 and Q2

      csc_min = common_stop_count / (min(len(q1_stops), len(q2_stops))
    • csc_max : Ratio of common_stop_count to max lenghth of stop count of Q1 and Q2

      csc_max = common_stop_count / (max(len(q1_stops), len(q2_stops))
    • ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2

      ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
    • ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2

      ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
    • last_word_eq : Check if last word of both questions is equal or not

      last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
    • first_word_eq : Check if First word of both questions is equal or not

      first_word_eq = int(q1_tokens[0] == q2_tokens[0])
    • abs_len_diff : Abs. length difference

      abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
    • mean_len : Average Token Length of both Questions

      mean_len = (len(q1_tokens) + len(q2_tokens))/2
    • Levenshtein Distance: Levenshtein Distance measures the difference between two text sequences based on the number of single character edits (insertions, deletions, and substitutions) it takes to change one sequence to another. It is also known as “edit distance”. The Python library fuzzy-wuzzy can be used to compute the following:

    • longest_substr_ratio : Ratio of length longest common substring to min lenghth of token count of Q1 and Q2

      longest_substr_ratio = len(longest common substring) / (min(len(q1_tokens), len(q2_tokens))

4. Featuring text data with tf-idf weighted word-vectors (With 2 parameters of Question1 and Question2)

  • Extracted Tf-Idf features for this combined question1 and question2 and got features with Train data.
  • After we find TF-IDF scores, we convert each question to a weighted average of word2vec vectors by these scores.
  • here I use a pre-trained GLOVE model which comes free with "Spacy". https://spacy.io/usage/vectors-similarity
  • It is trained on Wikipedia and therefore, it is stronger in terms of word semantics.
  • Note: When you are reviewing this part of code, I am sure you will be confuse, why I am directly copy pest the directory of glove pre-trained embedding model in spacy.load function, this is because due to some issue I am unable to call this downloaded file directly.

5. Simple tf-idf Vectorizing the Data (With 2 parameters of Question 1 and Question 2)

  • Performing Simple TF-IDF Tokenization on columns- 'question1', 'question2'.
    vectorizer= TfidfVectorizer()
    ques1 = vectorizer.fit_transform(data['question1'].values.astype('U'))
    ques2 = vectorizer.fit_transform(data['question2'].values.astype('U'))

6. Word2Vec Feature: Distance Feature And Genism’s WmdSimilarity Features (To use WMD, we need some word embeddings first of all. Download the GoogleNews-vectors-negative300.bin.gz pre-trained embeddings (warning: 1.5 GB))

  • Word embeddings such as Word2Vec is a key AI method that bridges the human understanding of language to that of a machine and is essential to solving many NLP problems. Here we discuss applications of Word2Vec to Question analysis.

  • Word2Vec feature:

    • Multi-dimensional vector for all the words in any dictionary

    • Always great insights

    • Very popular in natural language processing tasks

    • Google news vectors 300d (Pre trained embedding)

      def sent2vec(s):
        words = str(s).lower()
        words = word_tokenize(words)
        words = [w for w in words if not w in stop_words]
        words = [w for w in words if w.isalpha()]
        M = []
        for w in words:
            try:
                M.append(model[w])
            except:
                continue
        M = np.array(M)
        v = M.sum(axis=0)
        return v / np.sqrt((v ** 2).sum())
      
      model = gensim.models.KeyedVectors.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin.gz', binary=True)
  • As we performed Word2Vec, now time to create distance feature.

  • The similarity between questions can be computed using word-to-word (pairwise) distances, which are weighted with Word2Vec.

  • Pairwise Distances — We can compute the pairwise distances for each pair of words by picking the first word from question 1 and the second word from question 2. Several pairwise distance metrics can be used as features, including WMD_distance, norm_wmd distance, Cityblock_distance, Bray-Curtis_distance, Cosine_distance, Canberra_distance, Euclidean_distance, Minkowski_distance, and jaccard_distance.

7. Machine Learning Models:

#86b300 a. Random Model

p3

#86b300 b. Logistic Regression

  • Logistic regression is a linear model for classification. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. The logistic function is a sigmoid function, which takes any real input and outputs a value between 0 and 1, and hence is ideal for classification.

    When a model learns the training data too closely, it fails to fit new data or predict unseen observations reliably. This condition is called overfitting and is countered, in one of many ways, with ridge (L2) regularization. Ridge regularization penalizes model predictors if they are too big, thus enforcing them to be small. This reduces model variance and avoids overfitting.

    Hyperparameter Tuning:

    Cross-validation is a good technique to tune model parameters like regularization factor and the tolerance for stopping criteria (for determining when to stop training). Here, a validation set is held out from the training data for each run (called fold) while the model is trained on the remaining training data and then evaluated on the validation set. This is repeated for the total number of folds (say five or 10) and the parameters from the fold with the best evaluation score are used as the optimum parameters for the model.

#86b300 c. Linear SVM

  • Linear SVM is the newest extremely fast machine learning (data mining) algorithm for solving multiclass classification problems from ultra large data sets that implements an original proprietary version of a cutting plane algorithm for designing a linear support vector machine. LinearSVM is a linearly scalable routine meaning that it creates an SVM model in a CPU time which scales linearly with the size of the training data set.
  • Features
    • Efficiency in dealing with extra-large data sets (say, several millions training data pairs)
    • Solution of multiclass classification problems with any number of classes
    • Working with high dimensional data (thousands of features, attributes) in both sparse and dense format
    • No need for expensive computing resources (personal computer is a standard platform)

#86b300 d. XGBoost

  • Stands for eXtreme Gradient Boosting. Gradient boosting is an approach that predicts the errors made by existing models and adds models until no improvements can be made.
  • There are two main reasons for using XGBoost:
    • Execution speed
    • Model performance

8. Results & Conclusion

  • For below table we are comparing all the ML model test log-loss score.
  • I didn’t used total train data to train my algorithms. Because of ram availability constraint in my PC, I sampled some data and Trained my models. below are models and their test log-loss scores.
  • For below table Sim Fs - Simple or Basic Feature set,and Adv FsAdvanced Feature set.
DataSet Size Model Name Features Hyperparameter Tuning Test Log Loss
~ 404K Random Sim Fs+Adv Fs+TFIDF Weighted W2V NA 0.88
~ 404K Logistic Regression Sim Fs+Adv Fs+TFIDF Weighted W2V Done 0.42
~ 404K Linear SVM Sim Fs+Adv Fs+TFIDF Weighted W2V Done 0.45
~ 404K XGBoost Sim Fs+Adv Fs+TFIDF Weighted W2V NA 0.35
~ 100K XGBoost Sim Fs+Adv Fs+TFIDF Weighted W2V Done 0.33
---------- ---------- -------- ------ --------
~ 202K Random Sim Fs+Adv Fs+TFIDF Simple NA 0.88
~ 202K Logistic Regression Sim Fs+Adv Fs+TFIDF Simple Done 0.39
~ 202K Linear SVM Sim Fs+Adv Fs+TFIDF Simple Done 0.43
~ 202K XGBoost Sim Fs+Adv Fs+TFIDF Simple Done 0.31
---------- ---------- -------- ------ --------
~ 202K Random Sim Fs+Adv Fs+Word2Vec Features NA 0.88
~ 202K Logistic Regression Sim Fs+Adv Fs+Word2Vec Features Done 0.40
~ 202K Linear SVM Sim Fs+Adv Fs+Word2Vec Features Done 0.41
~ 202K XGBoost Sim Fs+Adv Fs+Word2Vec Features Done 0.33
  • We can see, as dimension increases (dim increases with TFIDF Simple) Logistic Regression and XGB starts to perform well, whereas Linear SVM produces best results with Sim Fs + Adv Fs + Word2Vec Features.

Technical Aspect

This project is divided into five part:

  1. I have done EDA, Created Basic Feature set (FS1), preprocessing on text data, Created Advanced Feature set using Fuzzy feature (FS2), Featuring text data with tf-idf weighted word-vectors (FS3), and applying ML Model (Random Model, Logistic Regression with hyperparameter tuning, and Linear SVM with hyperparameter tuning) in first part.

  2. Training XGBoost with hyperparameter tuning using FS1 + FS2 + FS3 in second part.

  3. I have created simple TF-IDF Vectorizer (FS4) and training ML Model (Logistic Regression with hyperparameter tuning, Linear SVM with hyperparameter tuning, and XGBoost with hyperparameter tuning) using FS1 + FS2 +FS4 in third part.

  4. I have created Distance Feature and Genism’s WmdSimilarity Features (FS5) and training ML Model (Logistic Regression with hyperparameter tuning, Linear SVM with hyperparameter tuning, and XGBoost with hyperparameter tuning) using FS1 + FS2 + FS5 in fourth part.

  5. Model Comparison and conclusion in fifth part.

Installation

The Code is written in Python 3.7. If you don't have Python installed you can find it here. If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip.

  • all the code for the article is available with this REPOSITORIES..

    How To

    1. Install Required Libraries

      pip3 install pandas
      pip3 install numpy
      pip3 install scikit-learn
      pip3 install nltk
      pip3 install tqdm
      pip3 install pyemd
      pip3 install fuzzywuzzy
      pip3 install python-levenshtein
      pip3 install --upgrade gensim
    2. Download Required Language libraries

      wget https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz