This repository contains the code developed for the Airbnb's Kaggle competition. It's written in Python, some in the form of Jupyter Notebooks, and other in pure Python 3.
The code produces predictions with scores around 0.88090% in the public leader-board, enough to be in the best 5% participants(0.001% behind the best) and 0.88509% in the private leader-board(0.0018% behind the winner)
The entire run should not take more than 4 hours(thanks to the parallel preprocessing) in a modern/recent computer, though you may run into memory issues with less than 8GB RAM.
Feel free to contribute to the code or open an issue if you see something wrong.
New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the average time to first booking, and better forecast demand.
In this competition, the goal is to predict in which country a new user will make his or her first booking. There are 12 possible outcomes of the destination country and the datasets consist of a list of users with their demographics, web session records, and some summary statistics.
Due to the Competition Rules, the data sets can not be shared. If you want to take a look at the data, head over the competition page and download it.
You need to download train_users_2.csv
, test_users.csv
and sessions.csv
files and unzip them into the 'data' folder.
Note: Since the train users file is the one re-uploaded by the competition
administrators, rename train_users_2.csv
as train_users.csv
.
-
The provided datasets have lot of NaNs and some other random values, so, a good preprocessing is the primary key to get a good solution:
- Replace -unknown- values with NaNs
- Clean age values
- Extract day, weekday, month, year from
date_account_created
andtimestamp_first_active
- Add number of missing values per user
- General user session information:
- Number of different values in
action
,action_type
,action_detail
anddevice_type
- Number of different values in
-
That kind of classification task works nicely with tree-based methods, I used
xgboost
library and the Gradient Boosting Classifier that provides alongscikit-learn
to make the probabilities predictions.
To replicate the findings and execute the code in this repository you will need basically the next Python packages:
- XGBoost Documentation - A library designed and optimized for boosted (tree) algorithms.
- Pattern Classification - Tutorials, examples, collections, and everything else that falls into the categories: pattern classification, machine learning, and data mining.
Copyright © 2015 David Gasquez Licensed under the MIT license.