We analyze food and drink cultures in 6 different cities from Northern America andidentify cultural boundaries across population at different scales based on the temporalpatterns of food and drink preferences. Each region in the map would subtly convey to us thecultural preferences which has resulted in this classification.
-
Converted csv files:
./yelp_dataset/raw/
-
Join tables:
./yelp_dataset/join_table/
restaurant_checkin.csv
restaurant_review.csv
-
Restaurant data:
./yelp_dataset/restaurant_data/
-
restaurant.csv
: businesses that are tagged with restaurant and food (shopping and grocery store are removed) -
processed_review.csv
: processed reviews (tokenize, lemmatize...)- text: processed reviews
- processed_text: business tags + processed reviews
-
rest_biz_tags.csv
: business_id + business tags list -
sub_rest_word_freq.csv
: immediate result of bag of word models of restaurant (frequency of term) -
sub_rest_tf_idf.csv
: weighted temrs (tags and reviews) using tf-idf
-
-
Features:
./yelp_dataset/features/
featured_biz_tags.txt
: filtered business tags after removing some business outliers (spa, nail spa...)business_tags.csv
: original list of business tags (extracted from businesses that are tagged with food, restaurant)1000_freq_word_review.csv
: top 1000 frequent words appear in reviews and business tags (uesd for selecting relevant features)features_review.txt
: selected features related to food, drink, ambience and price from reviewsword_freq_review.csv
: immediate result of relevant word frequency of each restaurant from its reviews
-
Convert json to csv and join tables
- json_to_csv.py
- convert business, reviews, and chekin json files to csv files
- filter out businesses that are not restaurant
- join_tables.py
- join restaurant and checkin tables
- join restaurant and review tables
- json_to_csv.py
-
Detect business outliers and categorize restaurants based on business tags
# functions for detecting business outliers, where K = 32 unique_tags, rest_tags = get_rest_tags() rest_tag_df = get_rest_tag_vec(unique_tags, rest_tags) k_means_rest_root(rest_tag_df, NUM_CLUSTERS = 32)
# functions for categorizing restaurants based on business tags, where K = 15 removed_categories_kmean = [20, 27, 28, 30, 16, 26, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 18, 19, 21, 22, 24, 31] filtered_rest = remove_business(removed_categories_kmean) filtered_tags = get_filtered_tags() filtered_rest_vec = get_rest_tag_vec(filtered_tags, filtered_rest) k_means_rest_root(filtered_rest_vec, NUM_CLUSTERS = 15)
-
Relevant features extraction
python3 feature_extraction.py
-
Further categorize restaurants into subcategories base on selected features in
features_review.txt
Use elbow method to determine number of clusters
python3 rest_sub_clustering.py
-
Measure Clutural similarties and boundaries at different scales with pairwise cosine similarties
python3 get_cult_bound.py
- Julia Hsu & Aiswarya Kannan - Initial work -