Quora Insinscere question detection

This is a notebook to explore the challenge of deecting if a given question on quora is sinscere or not. More details can be found at - https://www.kaggle.com/c/quora-insincere-questions-classification/data

Why - I found this problem really intterestin because insinscerity is a very subtle trait to detect. It is hard to come up with simple rules involving presence/absence of phrases or patters that can detect insinscerity. It would be interesting to see what LSTMs can achieve given the complex nature of the problem.

Challenges - THe good thing is there is plenty of data - around 1.3 mn examples for training models. The tricky part is that I do not have the hardware capability to train on such a huge dataset so I will be using a small subset. However my intent is not to achieve state of the art performance but to understand and explore LSTMs.

The approach

Rough outline of the approaches I plan to exxpreiment with -

Use linear models on sentence embeddings, exploring various approaches to get sentence embeddings from word embeddings
Use NN on sentence embeddings
Use LSTM on word embeddings

Word embeddings are very helpful and will help transform our examples into a desne and meaningful form. Linear models on top of these will give the baseline perofmrnace that we hope will be surpassed by deep leaning methods. Simple deep fully connected netwrk on top of word embeddings will capture the non-linear relationships that the linear model would have missed Finally an LSTM will capture sequential patterns in the example text which neither of the previous methods do. In a problem as subtle as this - I hypothesise that an LSTM(if trained right) should outperform both the above methods.

# Setup - loading libraries and data
import pandas as pd
import numpy as np
import utils
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,f1_score
%matplotlib inline

train = pd.read_csv("data/train.csv" , index_col = "qid")

train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1306122 entries, 00002165364db923c7e6 to ffffed09fedb5088744a
Data columns (total 2 columns):
question_text    1306122 non-null object
target           1306122 non-null int64
dtypes: int64(1), object(1)
memory usage: 29.9+ MB

train.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	question_text	target
qid
00002165364db923c7e6	How did Quebec nationalists see their province...	0
000032939017120e6e44	Do you have an adopted dog, how would you enco...	0
0000412ca6e4628ce2cf	Why does velocity affect time? Does velocity a...	0
000042bf85aa498cd78e	How did Otto von Guericke used the Magdeburg h...	0
0000455dfa3e01eae3af	Can I convert montra helicon D to a mountain b...	0

(train['target'].value_counts() / train.shape[0])

0    0.93813
1    0.06187
Name: target, dtype: float64

The data

Each observation in our data is a questions asked on quora and a target variable which is an integer indicating if the questiong was insinscere or not. We have aeounf 1.3 mn examples. No missing values in either field. Data set is imbalanced. Only around 6.1% of the total questions we have are insinscere questions. Such imbalance has repurcussions on how we train and evaluate our model. For ex. a 'dumb' model that always predicts that the question is not insinscere will be right ~94% of the times on our training data. Any model we train should be able to do better than at least that.

# Train test split
train_df, test_df = train_test_split(train , train_size = 0.7, random_state = 1)

/Users/dhruv/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Using sentence embeddings

Here I use Glove 50 dimnesional embeddings that are available for download for free. Higher dimnesional embeddings are definitely more helpful but increase computation requirements. Glove embeddings map each word to a 50 d numerical vector. To map a sentence(a question in our case) to a 50 d vector - I simply take the average od embeddings for each word in the sentence.

# Loading word embeddings
words_to_index, index_to_words, word_to_vec_map = utils.read_glove_vecs('data/glove.6B.50D.txt')
embed_size = word_to_vec_map['potatoes'].shape[0]

# Getting sentence embeddings
embed_feature_matrix = np.zeros((train.shape[0] , embed_size))

for i in range(train.shape[0]):
    avg = np.zeros(embed_size)
    sentence = train['question_text'].iloc[i]
    for w in sentence.split():
        avg = (avg + word_to_vec_map[w.lower()]) if w in words_to_index.keys() else avg
    avg = avg/len(sentence.split())
    embed_feature_matrix[i,:] = avg

# Example of embedding for one question
embed_feature_matrix[1,:]

array([ 0.30652914, -0.04615969,  0.03461759, -0.32862106,  0.32389972,
        0.11798259, -0.31097   , -0.17440473, -0.18743   ,  0.05849787,
        0.02323312,  0.25310656, -0.04408219, -0.06636275,  0.39459031,
        0.41205156,  0.19186312, -0.04145169,  0.17174231, -0.51833125,
       -0.02230881,  0.20463238,  0.230813  ,  0.08086213,  0.27062426,
       -1.4915    , -0.31363788, -0.16113384,  0.28794744, -0.42191206,
        2.8637    ,  0.37205625, -0.5678375 , -0.36670375,  0.00532304,
       -0.16805575,  0.08249931, -0.00493012, -0.05710663, -0.15098275,
       -0.27497237,  0.01004969,  0.20942125,  0.27642663,  0.13349706,
        0.03729038, -0.22709063,  0.06868144, -0.03569675,  0.20406713])

#fitting a linear logistic model to predict insinscere questions
logistic_model = LogisticRegression(penalty= "l2" , C = 1)
logistic_model.fit(embed_feature_matrix , train['target'])

/Users/dhruv/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)





LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

test_pred_proba_losgistic_model = logistic_model.predict_proba(embed_feature_matrix)
test_pred_losgistic_model = logistic_model.predict(embed_feature_matrix)

accuracy  = accuracy_score(train['target'] , test_pred_losgistic_model )
f1 = f1_score(train['target'] , test_pred_losgistic_model)

from sklearn.metrics import roc_auc_score 
rocauc = roc_auc_score(train['target'] ,test_pred_proba_losgistic_model[:,1])
print("ROC AUC Score -" ,rocauc)
print("Accuracy Score -" ,accuracy)
print("F1 Score -" ,f1)

ROC AUC Score - 0.8425774343251118
Accuracy Score - 0.9371528846463041
F1 Score - 0.11915441570984013

array([0.23020887, 0.17548449, 0.00072345, ..., 0.00223327, 0.00172832,
       0.04838918])

Decent performance for logistic regression with sentence embeddings. This was to be expected since insinscerity is a very subtle trait that is hard to pick up on without word sequence information - which is lost while averaging word embeddings to get sentence embeddings. We need to have a model that captures the context better.

RNN with word embeddings

from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform

Using TensorFlow backend.

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    """
    
    vocab_len = len(word_to_index) + 1                  
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    emb_matrix = np.zeros((vocab_len , emb_dim))
    
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    
    embedding_layer = Embedding(vocab_len, emb_dim, trainable = False)

    embedding_layer.build((None,))

    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.

    """
    
    m = X.shape[0]                                   # number of training examples

    X_indices = np.zeros((m ,max_len ))
    
    for i in range(m):                               # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words = (X[i].lower()).split()
        sentence_words = sentence_words[:min(max_len, len(sentence_words))]

        j = 0

        for w in sentence_words:
            if w in word_to_index.keys():
                X_indices[i, j] = word_to_index[w]
                j = j+1
    
    return X_indices

def LSTM_model(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating  model's graph.

    """

    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    embeddings = embedding_layer(sentence_indices)   
    X = LSTM(256,return_sequences=False )(embeddings)
    X = Dropout(0.2)(X)
    out = Dense(16, activation="relu")(X)
    out = Dense(1, activation="sigmoid")(out)

    model = Model(inputs = sentence_indices , outputs = out)

    
    return model

sentence_lens = [len(x.split()) for x in  list(train_df['question_text'])]

import matplotlib.pyplot as plt

plt.hist(sentence_lens , bins = 100)

x = pd.DataFrame(sentence_lens).describe(percentiles = [0,0.1 , 0.9,0.99])

maxLen = int(x.loc['99%'][0])

Going to cap sentence length to 40.

model = LSTM_model((maxLen,), word_to_vec_map, words_to_index)
model.summary()

WARNING:tensorflow:From /Users/dhruv/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /Users/dhruv/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 39)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 39, 50)            20000050  
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                4112      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 17        
=================================================================
Total params: 20,318,547
Trainable params: 318,497
Non-trainable params: 20,000,050
_________________________________________________________________

from keras import backend as K

def recall_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

def precision_m(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return y_pred


model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy',f1_m])

X_train_indices = sentences_to_indices(np.asarray(train_df['question_text']), words_to_index, maxLen)
Y_train = np.asarray(train_df['target'])

X_train_indices.shape

(914285, 39)

history = model.fit(x= X_train_indices[:100000, ], y= Y_train[:100000], batch_size=32, epochs=1, verbose=1, callbacks=None, validation_split=0.2)

Train on 80000 samples, validate on 20000 samples
Epoch 1/1
80000/80000 [==============================] - 169s 2ms/step - loss: 0.2317 - acc: 0.9387 - f1_m: 0.0616 - val_loss: 0.2263 - val_acc: 0.9398 - val_f1_m: 0.0586

y_pred = model.predict(X_train_indices[:100000, ])

dict_keys(['val_loss', 'val_acc', 'val_f1_m', 'loss', 'acc', 'f1_m'])

# summarize history for accuracy
import matplotlib.pyplot as plt
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

plt.hist(y_pred , bins = 100)

(array([9.4407e+04, 1.9390e+03, 3.0940e+03, 4.1400e+02, 9.8000e+01,
        3.1000e+01, 1.0000e+00, 2.0000e+00, 0.0000e+00, 2.0000e+00,
        0.0000e+00, 3.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 8.0000e+00, 1.0000e+00]),
 array([0.05766571, 0.06163964, 0.06561357, 0.0695875 , 0.07356143,
        0.07753536, 0.08150929, 0.08548322, 0.08945715, 0.09343109,
        0.09740502, 0.10137895, 0.10535288, 0.1093268 , 0.11330073,
        0.11727466, 0.1212486 , 0.12522253, 0.12919647, 0.1331704 ,
        0.13714433, 0.14111826, 0.14509219, 0.14906612, 0.15304005,
        0.15701398, 0.1609879 , 0.16496183, 0.16893576, 0.17290969,
        0.17688362, 0.18085755, 0.18483149, 0.18880542, 0.19277935,
        0.19675328, 0.20072721, 0.20470114, 0.20867507, 0.212649  ,
        0.21662293, 0.22059686, 0.2245708 , 0.22854473, 0.23251866,
        0.23649259, 0.24046652, 0.24444045, 0.24841438, 0.25238833,
        0.25636226, 0.26033616, 0.2643101 , 0.26828402, 0.27225795,
        0.27623188, 0.28020582, 0.28417975, 0.28815368, 0.2921276 ,
        0.29610154, 0.30007547, 0.3040494 , 0.30802333, 0.31199726,
        0.3159712 , 0.31994513, 0.32391906, 0.327893  , 0.33186692,
        0.33584085, 0.33981478, 0.3437887 , 0.34776264, 0.35173658,
        0.3557105 , 0.35968444, 0.36365837, 0.3676323 , 0.37160623,
        0.37558016, 0.3795541 , 0.38352802, 0.38750196, 0.3914759 ,
        0.39544982, 0.39942375, 0.40339768, 0.4073716 , 0.41134554,
        0.41531947, 0.4192934 , 0.42326733, 0.42724127, 0.4312152 ,
        0.43518913, 0.43916306, 0.443137  , 0.44711092, 0.45108485,
        0.45505878], dtype=float32),
 <a list of 100 Patch objects>)

pd.DataFrame(y_pred).describe()

rocauc = roc_auc_score(Y_train[:100000] ,y_pred )
print("ROC AUC Score -" ,rocauc)

ROC AUC Score - 0.6888920713148698

Observations -

ROC AUC better for NN model than linear model
Optimum threshold for probability output to get final predictions still need to be caliberated
Hyper parameters - batch size, n_epochs to be caliberated based on loss plots - need a more powerful machine
Having excessive padding due to just one long sentece was making the network hard to trin - all predictions were the same.
Class imbalance means that the probabilities for positive class will always be low and appropriate thresholds have to be chosen. Like in this case - the max output prob is around ~ 7%

dhruvnigam93/quoraIntentDetection

Quora Insinscere question detection

The approach

The data

Using sentence embeddings

RNN with word embeddings