/DataMining-Captone

Capstone Project for Data Mining

Primary LanguageHTML

DataMining-Captone

Capstone Project for Data Mining

Task1 (week1)

The goal of this task is to explore the Yelp data set to get a sense about what the data look like and their characteristics. You can think about the goal as being to answer questions such as:

What are the major topics in the reviews? Are they different in the positive and negative reviews? Are they different for different cuisines? What does the distribution of the number of reviews over other variables (e.g., cuisine, location) look like? What does the distribution of ratings look like?

Task 1.1

Use a topic model (e.g., PLSA or LDA) to extract topics from all the review text (or a large sample of them) and visualize the topics to understand what people have talked about in these reviews.

For example, after applying LDA to a sample of the reviews, we obtained the following visualization. Here the opacity of each node corresponds to its weight in each topic.

Task 1.2

Do the same for two subsets of reviews that are interesting to compare (e.g., positive vs. negative reviews for a particular cuisine or restaurant), and visually compare the topics extracted from the two subsets to help understand the similarity and differences between these topics extracted from the two subsets. You can form these two subsets in any way that you think is interesting. Here we show a sample visualization for a sample of reviews with high and low ratings.