I Just Want To Be Popular...On Airbnb
Overview
I Just Want To Be Popular is a data science project on predicting popularity of Airbnb listings in San Francisco in 2017. Will a listing be in the top 20 percent?
To take a look at my capstone presentation, click here.
Motivation
Airbnb hosts want their listings to be popular to increase the number of bookings they receive and therefore increase profits. This project serves to extend some answer to what hosts should do to become more popular.
Data
Data Source: InsideAirbnb.com
Date Scope: The data set used in this project consists of 117,107 total listings in 2017. There are approximately 8,000-16,000 listings per month.
Popularity Proxy: I used the number of "reviews per month".
Baseline
The scores in baseline 2 were determined from randomly guessing with 1,000 trials. The code for this can be found in Baseline_Score.ipynb.
Analysis:
Model A: TEXT FEATURE
NLP analysis of several text categories indicated that the description of the listing was the best predictor of the popularity of the listing. After removal of the English stopwords and lemmatization of the text, I ran the text through the CountVectorizer and the TF-IDF. Random Forest with the CountVectorizer gave me the highest F1 score of 0.86. The code for Model A can be found in Model_A.ipynb. If you are interested in the code for the other text categories, see notebooks that start with NLP in the title.
Model B: NON-TEXT FEATURES
The features I used included 'host_length', 'amenities_count', 'review_scores_rating', 'host_response_rate', 'access_filled', 'house_rules_filled','space_filled','accommodates', 'extra_people', 'price_per_guest', 'price_per_bedroom','guests_included', 'host_about_filled', 'cancellation_policy', 'room_type', 'property_type_new', 'instant_bookable', 'calculated_host_listings_count', and 'minimum_nights'. A description of these features can be found in Appendix A below.
XGBoost was my best model, with an F1 score of 0.88. The code for Model B can be found in Model_B.ipynb.
Final:
My final model is an ensemble model of one model from the text feature and one model from the non-text features. The ensemble model with the best F1 score was the Random Forest from Model A and XGBoost from Model B. This ensemble model predicted with a recall score of ~94%, precision score of ~86%, f1 score of ~90%. This model was able to better identify popularity of listings relative to the baseline recall score of ~88%. The code for the final model can be found in Ensemble.ipynb.
Findings:
These are the top words in the description of the listing that the model used to identify popularity.
These are the top five features identified by the model as the most important.
Given the findings, here is a sample text that I put together for what would be a good description for a listing. The words in blue are the key words that the model used to identify as popular (see word cloud above).
Appendix A: Description of features
host_length - years that the host has been hosting
amenities_count - the number of amenities the host offers for the listing
review_scores_rating - review score for the listing
host_response_rate - responsiveness of the host
access_filled - is there a description under the access category (yes or no)
house_rules_filled - is there a description under the house rules category (yes or no)
host_about_filled is there a description under the host about category (yes or no)
space_filled - is there a description under the space category (yes or no)
accommodates - the number of people the place accomodates
extra_people - the number of extra people that the place can accomodate
price_per_guest - the log of the listing price plus cleaning fee per number of guests
price_per_bedroom - the log of the listing price plus cleaning fee per number of bedrooms
guests_included - the number of guests included with the listing price
cancellation_policy - the kind of cancellation policy
room_type - the type of room
property_type_new - the type of property
instant_bookable - is this place instantly bookable (yes or no)
calculated_host_listings_count - the number
minimum_nights - the minimum number of nights one must stay to book the place