This is a Kaggle Competition started on Dec. 21 2019. URL: https://www.kaggle.com/c/google-quest-challenge/overview
In this competition, we need to use a new dataset to build predictive algorithms for different subjective aspects of question-answering. The question-answer pairs were gathered from nearly 70 different websites, in a "common-sense" fashion.
There are a list of 30 target labels, which are the same as the column names in the sample_submission.csv
file. Target labels with the prefix question_
relate to the question_title
and/or question_body
features in the data. Target labels with the prefix answer_
relate to the answer feature.
Each row contains a single question and a single answer to that question, along with additional features. The training data contains rows with some duplicated questions (but with different answers). The test data does not contain any duplicated questions.
This is not a binary prediction challenge. Target labels are aggregated from multiple raters, and can have continuous values in the range [0,1]. Therefore, predictions must also be in that range. Also, we could see from EDA, values are not literally continuous. It is divided into about nine values, like 1/3, 1/2, 2/3 etc.
Submissions are evaluated on the mean column-wise Spearman's correlation coefficient. The Spearman's rank correlation is computed for each target column, and the mean of these values is calculated for the submission score.
-
200102_Data&EDA_Jason.ipynb
is general EDA over the datasets, so we could have a general understanding of the datasets. -
Then without doing much feature engineering, a fine-tuned Bert Model
200102_BertModel_Jason.ipynb
is applied to reach a baseline score of 0.385. -
From the cross validation result from
191228_Baseline_with_validation_Cara.ipynb
we could find that the general performance is good while three columns among them:question_not_really_a_question
,question_type_consequence
andquestion_type_spelling
have extremely bad performance due to dramatic imbalance in the training set.
-
We have tried Stratified KFold to make sure the distribution of validation samples is the same with the entire dataset to eliminate the imbalance in the training set. The code is here. Yet the performance never improved.
-
Also, we tried to do feature engineering, because for the column
question_type_spelling
, every non-zero value is in theCULTURE
category (in particular, theenglish
orell
(English Language Learners) stackexchange URLs). As such, it worked in post-processing to just hardcode '0.0s' in any non-CULTURE
category and 1.0 for the rest. The code is here and the performance improved to 0.389.
If anyone has tried anything and think it'll be helpful to share, please keep the following format: "Time_Title_Name"
- e.g.
191228_EDA_Jason
,191227_BERT_Trial1_Jason
If you are not really familiar with how to connect github with your local git or having problems with creating pull requests or branches, there is a simple github tutorial, it might be helpful.
And if on the contrary, you are very familiar with github and think there should be more in the tutorail, feel free to add more.