Parallel Apriori Algorithm

C++ Implementation of Parallel Apriori Algorithm

This project modified from bowbowbow/Apriori. We provide 6 different parallel methods:

The test case we used is a renowned transactional database T10I4D100K.

Enviroment

OS: Ubuntu 20.04.3 LTS
g++: 9.3.0
MPI: 4.0.3
CUDA driver: 11.5
CUDA: 10.1.243
Hadoop: 3.3.1

Run Apriori Algorithm

Move to the folder you want.
```
cd [folder]
```
Follow the steps in README.md to install the dependencies.
Build executables.
```
make
```

Run the Apriori algorithm with following command.

./[executable] [minimum support] [input transactions] [output result]

Example

For instance, if you want to run the pthread version of the Apriori algorithm, you can use the commands below.
```
cd pthread
make
./apriori-thread 1 ../testcases/transactional_T10I4D100K.csv output.txt
```

Details

Summary of the algorithm

============ Apriori algorithm is to find frequent itemsets using an iterative level-wise approach based on candidate generation.

Input: A database of transactions, the minimum support count threshold

Output: frequent itemsets in the database

The algorithm solves the problem with a two-step approach.

Step1. Frequent Itemset Generation

Generate all itemsets whose support with a value of minimum support or greater.

But the proccess requires a lot of computation (O(3^(k)-2^(k+1)+1), k=the number of item).

The key idea of apriori algorithmto reduce this operation is that any subset of a frequent itemset must be frequent.

Therefore, if there is any itemset which is infrequent, its superset should not be generated/tested.

As a result, follow the steps below to proceed with this process.

Generate length (k+1)-candidate itemsets from length k frequent itemsets. (This process is called joining.)

And delete newly generated (k+1)-items if the item set that removed one element is not in (k)-candidate. (This process is called pluning.)

Calculate the support of the candidates and remove candidates with support less than min support.

Proceed 1 again until there are no more candidates left.

Step2. Associate Rule Generation

Generate high confidence rules from each frequent itemset, where each rule is a binary partition of a frequent itemset.

About input file

============

Input file format

[item_id]\t[item_id]\n

[item_id]\t[item_id]\t[item_id]\t[item_id]\t[item_id]\n

[item_id]\t[item_id]\t[item_id]\t[item_id]\n

Row is transaction
[item_id] is a numerical value
There is no duplication of items in each transaction

About output file