/frequent-itemset-association

Market basket analysis of finding frequent itemsets using SON algorithm in Spark

Primary LanguagePythonApache License 2.0Apache-2.0

Market Basket Analysis

  • Implementation of SON algorithm that leverages Apriori to find frequent items bought together in a supermarket
  • Developed the algorithm in a truly 'Big Data' environment using Spark
  • SON algorithm uses MapReduce in two phases
  • Algorithm implemented for different support thresholds and data sizes (1KB - 500MB)

Execution code for small data size:

bin/spark-submit Nikhit_Mago_SON.py [case] [......../Small2.csv] [support]

Execution code for medium data size:

bin/spark-submit Nikhit_Mago_SON.py [case] [......../MovieLens.Small.csv] [support]

Execution code for large data size:

bin/spark-submit Nikhit_Mago_SON.py [case] [......../MovieLens.Big.csv] [support]

Notes:

  • Please use Python 2.7 and Spark 2.2.1 to execute the PySpark script
  • Data Source is provided here
  • MovieLens.Big.csv is the ratings.csv file from ml-20m dataset
  • MovieLens.Small.csv is the ratings.csv file from ml-latest-small dataset
  • Case 1 is for combinations of frequent movies (as singletons, pairs, triples, etc. . . ) that were rated and are qualified as frequent given a support threshold value.
  • Case 2 is for combinations of frequent users (as singletons, pairs, triples, etc. . . ) that were rated and are qualified as frequent given a support threshold value.

Execution Table for Large Dataset:

Case 1:

Support Threshold Execution Time (s)
30000 ~1110
35000 ~365

Case 2:

Support Threshold Execution Time (s)
2800 ~1400
3000 ~850