Airbnb-Data-Analysis: A Python repository from petebytes

Airbnb Data Analysis

Introduction:

Airbnb is an online marketplace and hospitality service, enabling people to lease or rent short-term lodging including vacation rentals, apartment rentals, hostel beds or hotel rooms. The company does not own any lodging but is merely a broker and receives percentage service fees from both guests and hosts in conjunction with every booking. It has over 3,000,000 lodging listings in 65,000 cities and 191 countries, and the cost of lodging is set by the host. Like all hospitality services, Airbnb is a form of collaborative consumption and sharing. The site's content had expanded from air beds and shared spaces to a variety of properties including entire homes and apartments, private rooms, castles, boats, manors, tree houses, tipis, igloos, private islands and other properties. With growing popularity and market presence, it has current and future impacts on the traditional accommodation sector. The data generated by Airbnb consisting of listings, ratings and reviews has increased and can be analyzed to gain more insights into customer reviews.

Motivation:

One interesting feature of these online sharing economies is the review system. In case of Airbnb, hosts and guests may review their experience in 500 words or less at the end of each stay. The reviews contain strong recommendations for e.g. “George was a great host, we stayed an extra day!” In addition to creating a sense of accountability, reviews provide useful information to travelers about what to expect of their rental experience. Qualitative descriptions such as “small but charming”, “noisy at night”, “free coffee” are often far more insightful than numerical ratings. Our motivation behind this project was to help hosts get better understanding of what factors majorly affect ratings and reviews in their city and what characteristics are associated with the negative rental experiences. This would let the hosts improve in certain areas of their listing to improve their ratings and experiences for their guests. Secondarily, we wanted to aid customers by providing visual information of best listing based on consolidated ratings Objectives:

• To perform sentiment analysis on Airbnb review data. • Classify reviews as positive and negative based on sentiment analysis. • To determine what makes a rental experience good or bad. • Analyze data to obtain keywords that frequently occur in negative reviews of listings in a US city. • Plot top-rated listings in each category (apt, B&B, house, etc.) in US cities.

Design:

We aimed at achieving the following design goals • Scalability: Our project would cater to large datasets and adjust to changed requirements.

• Automated Workflow: We built a machine learning model so that the classification can be done automatically.

• Modularity: We implemented different modules for data cleaning, modelling and analysis so that each can be tested and debugged separately.

Data Sources: The dataset was obtained from http://insideairbnb.com/get-the-data.html.

• Reviews data: (ID, Listing_ID, Date, Reviewer_ID, Reviewer_name, Comments)

• Listings data: (ID, Host_name, city, state, Latitude, Longitude, Property_type, Review_score_consolidated)

• Calendar data: (ID, Dates, Price)

Technologies used:

For Data Preprocessing we made use of Python. To get properly labelled ‘training dataset’ for our model, NLTK was used. We initially began using H2O.ai for our machine learning model. However, the model created was not efficient and gave incorrect results. We then created a model in SparkML.

After getting classified data as positive and negative reviews, for Data Analysis we used Scala in Apache Spark. The results received from the analysis were given a presentable form using Tableau. A web app was created to display results on map using Plotly.

Data Cleaning:

• Newline characters from review comments were removed which otherwise wouldn't parse as a csv file in Spark or H2O.

• Our model was built considering only review comments of English language. So, we filtered out comments of other languages using the langdetect python package.

• Empty prices in calender file was replaced by the mean price.

Hashing TF:

HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing.

Naïve Bayes:

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features The comments were sentimentally analyzed to categorize as 'negative' and 'positive' comments.

We analyzed the combination of listings data as well as calendar data to group rows based on Property type and decreasing order of ratings We plotted the obtained top ten values in each property type and plotted them on a map using plotly.

Conclusion

• Labelling of positive and negative reviews was carried out by the model successfully

• We conclude from our analysis that negative reviews in different cities focus on different keywords

• Also, we successfully suggested top 10 listings to the users based on consolidated ratings.

Future Work:

• Incorporating Spark Streaming to stream static files.

• Establish a relation between listing availability and rating

• Analysis of trends in price change by hosts during holiday season, festivals, etc.

References: • http://insideairbnb.com/get-the-data.html.

petebytes/Airbnb-Data-Analysis