Zhihu Machine Learning Challenge 2017
Categories
Abstract
In the Zhihu Machine Learning Challenge 2017, we were asked to build a model to automaticly and accurately tag topics for Zhihu contents. Our final submission was a 2-stage process and scored 0.43436 on Public LB and 0.43273 on Private LB, ranking 3rd out of all teams. This documents describes our team's solution which can be dived into two parts:
- Deep Learning: build variance DL models to sort all topics.
- Learning To Rank: build RankGBM model to sort ten of most possible topics.
Learning To Rank
In the first stage, we can get the DL model prediction results for each <instance, topic> pairs. In the second stage, we will vote for all instances based on ML model results. After voting, each instance is associated with ten of the most likely topics. Then, build a RankGBM model to sort ten of most possible topics.
The above description can be done by the following steps:
-
Enter root directory of the project:
cd zhihu-machine-learning-challenge-2017/
-
Vote for offline dataset and online dataset:
python -m bin.rank.vote conf/rank_v29.conf vote offline python -m bin.rank.vote conf/rank_v29.conf vote online
-
Generate features for offline dataset and online dataset:
# generate <instance, topic> pair features python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_model offline python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_model online # generate <instance> features python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_instance offline python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_instance online # generate <topic> features python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_topic offline python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_topic online
-
Generate rank data files for offline dataset and online dataset:
python -m bin.rank.rankgbm.rank_data conf/rank_v29.conf generate_offline python -m bin.rank.rankgbm.rank_data conf/rank_v29.conf generate_online
-
train a RankGBM model based on offline dataset and predict for online dataset:
# 3-fold cross validation python -m bin.rank.rankgbm.run conf/rank_v29.conf train 0 rank_v29 python -m bin.rank.rankgbm.run conf/rank_v29.conf train 1 rank_v29 python -m bin.rank.rankgbm.run conf/rank_v29.conf train 2 rank_v29 # predict for online dataset python -m bin.rank.rankgbm.run out/rank_v29/conf/featwheel.conf test
-
Finally, you can get a submit file here:
vim out/rank_v29/pred/rank_submit.online.29