Java implementation of the paper "Finding Highly Correlated Pairs with Powerful Pruning" by Zhang, J., & Feigenbaum, J. (2006, November). In Proceedings of the 15th ACM international conference on Information and knowledge management (pp. 152-161). ACM. pdf
The only dependency is Guava.
Licensed under the Apache 2.0 License.
The input are many transactions (aka baskets), each of which is a set of items. This algorithm will efficiently find items that tend to occur together in the same transactions. Technically, this is finding all pairs of items whose phi coefficient is above a specified threshold. This correlation is measured on the sets of transactions the two items occur in. These transactions can be any sets, such as baskets of products a user purchased or words in documents. This project provides sets of ingredients used by recipes, extracted from the Kaggle What's Cooking dataset in the file kaggle_whats_cookin.txt
. That data has one listing of ingredient per line and looks like this:
plain_flour ground_pepper salt tomatoes ground_black_pepper thyme eggs green_tomatoes yellow_corn_meal milk vegetable_oil
eggs pepper salt mayonaise cooking_oil green_chilies grilled_chicken_breasts garlic_powder yellow_onion soy_sauce butter chicken_livers
...
Here are pairs of ingredients sorted only by the number of recipes they occur in together.
ingredient1 ingredient2 freq
==============================================
salt onions 4392
salt olive_oil 4180
salt water 3960
pepper salt 3844
garlic salt 3749
salt all-purpose_flour 3079
salt sugar 3061
salt garlic_cloves 2998
salt butter 2777
salt ground_black_pepper 2737
garlic onions 2659
onions olive_oil 2207
olive_oil garlic_cloves 2103
salt vegetable_oil 2101
garlic olive_oil 2080
water onions 1974
salt eggs 1919
salt black_pepper 1792
salt large_eggs 1700
salt tomatoes 1682
onions garlic_cloves 1643
garlic water 1605
water sugar 1576
salt ground_cumin 1533
ground_black_pepper olive_oil 1520
While this listing is useful, it does not account for the fact that certain ingredients are more common than others and will therefore occur with many other ingredients purely because they are both frequent. That's why "salt" shows up so often. Let's use the FHCP algorithm to find only pairs of ingredients that occur together at least ten times, and whose phi coefficient is at least 0.3:
ingredient1 ingredient2 Phi
==============================================
green_cardamom brown_cardamom : 0.460
warm_water active_dry_yeast : 0.453
pastry single_crust_pie : 0.450
garlic_powder onion_powder : 0.436
soy_sauce sesame_oil : 0.431
mirin sake : 0.409
buttermilk baking_soda : 0.408
wasabi nori_sheets : 0.405
kaffir_lime_leaves galangal : 0.393
green_cardamom shahi_jeera : 0.385
garlic_paste ginger_paste : 0.379
baking_powder baking_soda : 0.368
lemongrass galangal : 0.367
baking_powder all-purpose_flour : 0.364
cinnamon_sticks clove : 0.348
dried_bonito_flakes konbu : 0.341
warm_water dry_yeast : 0.337
ground_cumin ground_coriander : 0.336
cumin_seed coriander_seeds : 0.329
garam_masala coriander_powder : 0.328
mascarpone ladyfingers : 0.321
konbu bonito_flakes : 0.319
garlic_paste coriander_powder : 0.307
cumin_seed ground_turmeric : 0.306
garam_masala ground_turmeric : 0.305
rice_noodles beansprouts : 0.304
light_soy_sauce dark_soy_sauce : 0.301
lemongrass kaffir_lime_leaves : 0.301
ground_turmeric coriander_powder : 0.301
Much more interesting. The pairs make sense, yeast needs warm water to be activated, and many of the pairs reflect common regional pairings.
While the phi coefficient is a good measure of associativity, and it is equal to Pearson correlation for binary data, is proportional to the Chi square test, and relates to Jaccard similarity, it isn't the only one. There are other measures of associativity on 2x2 contingency tables, such as Fisher's eact test and the g-score. The algorithm implemented here uses Minhashes to find items whose transaction sets have high Jaccard similarity in a first pass. Then the phi coefficient is calculated in a second pass. This second pass could easily be replaced by one that caclulated an alternative associativity measure.
The algorithm has four parameters:
- theta
- The minimum Phi correlation that two items should have between their transaction sets. (e.g., 0.3)
- tau
- False negative tolerance. Allow no more than this fraction of false negatives. (e.g., 0.05 means 5%)
- minsup
- The minimum number of transactions two items must occur in together to be considered correlated. (e.g., 10)
- k
- Number simultaneous minhashes needed to match. For large numbers of unique items such as in the recipe dataset, k=1 is okay. For many transactions and small number of unique item types, large k values may be reasonable.
Everything is implemented in FHCP.java
, which has a main()
method that takes in the data file name:
java -classpath . FHCP kaggle_whats_cookin.txt
A related concept is frequent itemset mining, for which there is a great Java library called SPMF.