Zhihu Machine Learning Challenge 2017

Abstract

In the Zhihu Machine Learning Challenge 2017, we were asked to build a model to automaticly and accurately tag topics for Zhihu contents. Our final submission was a 2-stage process and scored 0.43436 on Public LB and 0.43273 on Private LB, ranking 3rd out of all teams. This documents describes our team's solution which can be dived into two parts:

Deep Learning: build variance DL models to sort all topics.
Learning To Rank: build RankGBM model to sort ten of most possible topics.

Learning To Rank

In the first stage, we can get the DL model prediction results for each <instance, topic> pairs. In the second stage, we will vote for all instances based on ML model results. After voting, each instance is associated with ten of the most likely topics. Then, build a RankGBM model to sort ten of most possible topics.

The above description can be done by the following steps:

Enter root directory of the project:

cd zhihu-machine-learning-challenge-2017/

Vote for offline dataset and online dataset:

python -m bin.rank.vote conf/rank_v29.conf vote offline
python -m bin.rank.vote conf/rank_v29.conf vote online

Generate features for offline dataset and online dataset:

# generate <instance, topic> pair features
python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_model offline
python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_model online
# generate <instance> features
python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_instance offline
python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_instance online
# generate <topic> features
python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_topic offline
python -m bin.rank.feature conf/rank_v29.conf generate_featwheel_feature_from_topic online

Generate rank data files for offline dataset and online dataset:

python -m bin.rank.rankgbm.rank_data conf/rank_v29.conf generate_offline
python -m bin.rank.rankgbm.rank_data conf/rank_v29.conf generate_online

train a RankGBM model based on offline dataset and predict for online dataset:

# 3-fold cross validation
python -m bin.rank.rankgbm.run conf/rank_v29.conf train 0 rank_v29
python -m bin.rank.rankgbm.run conf/rank_v29.conf train 1 rank_v29
python -m bin.rank.rankgbm.run conf/rank_v29.conf train 2 rank_v29
# predict for online dataset
python -m bin.rank.rankgbm.run out/rank_v29/conf/featwheel.conf test

Finally, you can get a submit file here:

vim out/rank_v29/pred/rank_submit.online.29

HouJP/zhihu-machine-learning-challenge-2017

Zhihu Machine Learning Challenge 2017

Categories

Abstract

Learning To Rank