/WhatsCooking_WWC

women who code project folder - what's cooking

Primary LanguageHTML

Week 1: Intro (End 5/5)

Week 2: Discover 3 Features (End 5/12)

Summary Data

Total Recipes: 39774

# of cuisines 20
'italian' 7838
'mexican' 6438
'southern_us' 4320
'indian' 3003
'chinese' 2673
'french' 2646
'cajun_creole' 1546
'thai' 1539
'japanese' 1423
'greek' 1175
'spanish' 989
'korean' 830
'vietnamese' 825
'moroccan' 821
'british' 804
'filipino' 755
'irish' 667
'jamaican' 526
'russian' 489
'brazilian' 467

Total Ingredients: 428275

# of ingredients 6714
'salt' 18049
'olive oil' 7972
'onions' 7972
'water' 7457
'garlic' 7380
'sugar' 6434
'garlic cloves' 6237
'butter' 4848
'ground black pepper' 4785
'all-purpose flour' 4632
'pepper' 4438
'vegetable oil' 4385
'eggs' 3388
'soy sauce' 3296
'kosher salt' 3113
'green onions' 3078
'tomatoes' 3058
'large eggs' 2948
'carrots' 2814
'unsalted butter' 2782
'extra-virgin olive oil' 2747
'ground cumin' 2747
'black pepper' 2627
'milk' 2263
'chili powder' 2036
'oil' 1970
'red bell pepper' 1939
'purple onion' 1896
'scallions' 1891

Ingredients by Cuisine

Full set of Pie Charts Summary of Cuisines per Ingredient

Number of Cuisines per Ingredient

Full set of Pictures

Summary of Cuisines per Ingredient Summary of Cuisines per Ingredient Summary of Cuisines per Ingredient

Distribution of Categorized Ingredients

Distribution of categories of ingredients in cuisine

Appendix

We graphed each cuisine and the number of ingredients used in each of their recipes in hopes that would be a differentiator. However

  • the cuisines were very similar.

Ingredients per Cuisine

Week 3: Find 2 Models (End 5/19)

NB with TFID

Distribution of categories of ingredients in cuisine

  • Consistently around 63-65% accuracy (Full set results for trial/error)
  • Steps:
    • Create bag of words (or ingredient phrases in this case) for train dataset
    • Calculate tfidf for each ingredient and cuisine
    • Use MultinomialNB model, with tfidf and accompanying cuisines, to match up recipes
    • Predict based on cuisine with highest probability, calculated by summing cuisine probability for ingredient with probabilities higher than 0.4
  • Higher accuracy when:
    • Each recipe was a single "document"
    • full ingredients were used rather than ingredient words
    • cleaned for accents and lowercased
    • common/meaningless modifiers removed
    • hypens/parentheticals removed
    • low alpha in NB model
    • using MultinomialNB versus BernoulliNB
    • Use probability of 0.4 as cut-off point for when an ingredient is used for classifying
    • Used sum of quadratic probabilities (weight higher probabilities higher)
  • TODO:
    • Figure out probabilities where classification should be done with another model
    • Figure out which cuisines are frequently confused
    • Perhaps combine with ingredient-type and words

SVM with CountVectorizor

SVM charts SVM result

  • TODO:
    • Further clean data
    • Add ingredient count feature
    • Increase word list
    • Add cross validation
    • Experiment with params (only optimized for γ)

KNN

knn result

  • knn(train = traindf, test = testdf,cl = prc_train_labels, k=3)
Cuisine Predictions
brazilian 0
0.044
0.000
0.000
0.000
british 0
0.026
0.000
0.000
0.000
cajun_creole
chinese 0
0.188
0.000
0.000
0.000
filipino
french 0
0.300
0.000
0.000
0.000
greek 0
0.600
0.000
0.000
0.000
indian 0
0.094
0.000
0.000
0.000
irish
italian 1
6.371
0.006
0.025
0.001
jamaican
japanese 3
0.893
0.150
0.043
0.004
korean 0
0.138
0.000
0.000
0.000
mexican 109
0.006
0.784
0.172
0.136
moroccan 0
0.020
0.000
0.000
0.000
russian 0
0.016
0.000
0.000
0.000
southern_us 0
1.365
0.000
0.000
0.000
spanish
thai 0
0.065
0.000
0.000
0.000
vietnamese 0
0.012
0.000
0.000
0.000

Week 4: Submit 1 Kaggle (End 6/2)

###Pipeline pipeline ###Kaggle result kaggle submission NC result ###Confusion Matrix confusion matrix optimization