This is the source code of MonkeyLearn's series of posts related to analyzing sentiment and aspects from hotel reviews using machine learning models.
The project itself is a Scrapy project that is used to gather training and testing data from different sites like TripAdvisor and Booking. Besides, there are a series of Python scripts and Jupyter notebooks that implement some necessary scripts.
The TripAdvisor (hotel_sentiment/spider/tripadvisor_spider.py) spider is used to gather data to train a sentiment analysis classifier in MonkeyLearn. Reviews texts are used as the sample content and reviews stars are used as the category (1 and 2 stars = Negative, 4 and 5 stars = Positive).
To crawl ~15000 items from tripadvisor use:
scrapy crawl tripadvisor -o itemsTripadvisor.csv -s CLOSESPIDER_ITEMCOUNT=15000
You can check out the generated machine learning sentiment analysis model here.
The Booking spider (hotel_sentiment/spider/booking_spider.py) is used to gather data to train an aspect classifier in MonkeyLearn. The data obtained with this spider can be manually tagged with each aspect (eg: cleanliness, comfort & facilities, food, internet, location, staff, value for money) using MonkeyLearn's Sample tab or an external crowd sourcing service like Mechanical Turk.
To crawl from booking use:
scrapy crawl booking -o itemsBooking.csv
You first have to add the url of a starting city. To crawl from a single hotel in booking use:
scrapy crawl booking_singlehotel -o <hotel name>.csv
opinionTokenizer.py
is a simple script to obtain the "opinion units" from each review.classify_and_plot_reviews.ipynb
is a simple script that uses the generated model to classify new reviews and then plot the results in a graph using Plotly.
You can check out the generated machine learning aspect classifier here.
To crawl from Tripadvisor use:
scrapy crawl tripadvisor_more -a start_url="http://some_url" -o <hotel_name>.csv -s CLOSESPIDER_ITEMCOUNT=20000
With the url of a starting city to crawl from, such as https://www.tripadvisor.com/Hotels-g186338-London_England-Hotels.html.
The scripts and notebooks necessary to replicate the post are in the classify_elastic
folder:
classify_elastic/generate_files_for_indexing.py
will take the csv file produced by scrapy and generate two files that other scripts will use.classify_elastic/classify_pipe.py
will open theopinion_units
file and classify it with MonkeyLearn according to topic and sentiment, and save the results to a new csv file.classify_elastic/index_definition.json
contains the mapping definitions used in ElasticSearch.classify_elastic/index_reviews.py
will index into your ElasticSearch instance the reviews generated bygenerate_files_for_indexing.py
.classify_elastic/index_opinion_units.py
will index into your ElasticSearch instance the classified opinion units.classify_elastic/Extract keywords.ipynb
shows how to extract keywords from the indexed data.
Finally, the queries
folder contains some queries that were used to power the Kibana visualization.