title | author |
---|---|
Entry for Kaggle - Quora Competition |
Team Members: Germayne Ng, Alson Yap |
Competition
Link to competition details and question: https://www.kaggle.com/c/quora-question-pairs
Update Logs
-
Version 1.6 - 13th May 2017:
- Revamp LSA features, based on train and test data as 'document' as opposed to referencing them seperately.
- 6 new features: LSA Q1 component 1, 2 , Q2 component 1, 2 and 2 distancing based on the components
- total 55. (note that 2 features will overwrite the old distancing features, so 4 new additions)
- Major change in xgboost.py script: I have split section 5 into a,b,c. For Cross validation, do 5a> 5b. For modelling to submit, do 5a > 5c, skip 5b. Mainly to use full training set to modelling.
- score - 0.27XX
-
Version 1.6 - 12th May 2017:
- Added 4 magic features - 0.30XX
- Added AB 12 features - 0.28XX
- Total 51 features
-
Version 1.5 - 20th April 2017:
- Finally best score so far. Managed to make good use of the LSA components features.
- Distance features can be further distingused. Alson did the distance for single vector. If you define LSA components and apply distance, it can be a seperate feature.
- Based on Alson's functions for distance, I created a euclidean and manhattan function for single vector. Essentially, there are 2 features based on euclidean and manhattan distances (total 4 )
LSA components: Basically each question is now a vector in LSA-TFIDF components. I.E. Note the values are arbitary, just for example sake:
question component 1 component 2 question component 1 component 2 question 1 of pair 1 0.23 0.56 question 2 of pair 1 0.4 0.7 So, question 1 instead of words is now [0.23 , 0.56] and question 2 [0.4, 0.7]. Now as vectors, we then calculate its distances.
- tune 1000 nrounds and lower learning rate to 0.1 = better results.
-
Version 1.4 - 4th April 2017:
- Expanded on the TFIDF function (3), Added character count without spaces (3) and character count per word (3)
- Total 30 features. Accuracy: 0.32624 (Rank 162: Top 14%)
- Updated features dataset in dropbox
-
Version 1.3 - 1st April 2017:
- Added Jaccard distance and Cosine distance features. Total 21 features. (Rank 133: Top 15%)
-
Version 1.2 - 1st April 2017:
- Added 7 FuzzyWuzzy features. Total 19 features. (Rank 151: Top 15%)
-
Version 1.1 - 31st March 2017:
- Added additional features. Total 12 features. (Rank: 215 Top 25%)
-
Version 1.0 - 30th March 2017:
- Implemented Xgboost with 6 features.
References
-
Initial reference: https://www.kaggle.com/alijs1/quora-question-pairs/xgb-starter-12357/code
-
FuzzyWuzzy reference:
-
Jaccard and Cosine distance: https://www.kaggle.com/heraldxchaos/quora-question-pairs/adventures-in-scikitlearn-and-nltk/run/1040772
-
Future reference for new ideas:https://www.kaggle.com/c/quora-question-pairs/discussion/30340#171996
-
SVD, LSA components
-
Xgboost references
-
The art of hyper tuning
-
Xgboost notes by germ:
-
Ensemble methods:
To do/ comments
- Running the spell checker script on train set and testing set (~20k per hour)
So, train set will take about 404/20 = ~20 hours.
Then, test set will take 20hours * 6 = 120 hours == 5 days +
Spell checker script taken from: http://norvig.com/spell-correct.html
Download big.txt file because the script needs to reference to that corpus of words. I only added the sentence_correction function which is to be run on both data sets.
Conclusion: Fuzzy wuzzy features are really significant. But realise that after implementing the Jaccard and Cosine dist features, the rankings changed. Refer to the features importance in the figures folder