/Top-K-HUIM

Implementation of EFIM algorithm to solve top-K HUIM problem

Primary LanguageC++GNU Affero General Public License v3.0AGPL-3.0

Top K High Utility Itemsets Mining

Understanding and implementing the EFIM algorithm to solve the top-K HUIM problem.

Problem definition

Basic definitions

Transaction database

Let $I$ be a finite set of items (symbols). An itemset $X$ is a finite set of items such that $X \subseteq I$. A transaction $T$ is a set of items where each item $i \in T$ has an associated utility, denoted $U(i, T)$. A transaction databse is a set of transactions.

Utility of an itemset in a transaction

The utility of an itemset $X$ in a transaction $T (X \subset T)$ is defined as $$U(X, T) = \sum_{i \in X} U(i, T)$$

Utility of an itemset in a transaction database

Naturally, the utility of an itemset $X$ in a database $D$ is defined as $$U_D(X) = \sum_{T\in D \wedge X\subset T} U(X, T)$$

The problem of top $K$ high utility itemsets mining is to find top $K$ itemsets by utility in the given transaction database.

This problem is particularly interesting as the utility measure is neither monotonic nor anti-monotonic, i.e., the utility of an itemset may be lower, higher or equal to that of its subsets.

About EFIM

Among several others, EFIM (Zida et al., 2015) is a high utility itemset mining algorithm. It finds all itemsets that have utility higher than a given a minimum utility threshold using pruning strategies that greatly reduce the search space.

Modifying EFIM to solve top K HUIM

Priority queue version

Rather than having a fixed minimum utility threshold, we start with $min \textunderscore util = 1$. As the EFIM algorithm discovers new itemsets, we update the minimum utility threshold to the $K^{th}$ highest utility among the currently discovered itemsets. We only need to maintain the top $K$ itemsets which calls for a priority queue data structure.

Binary search version

We binary search on the minimum utility threshold. However, a naive binary search is rather slow because it takes around $10$ iterations of the costly search method to reduce a range of size $1000$. Rather we must prematurely retrieve the itemsets and delete the excess by sorting.

The above repo contains C++ implementations of the above ideas and a comparison of their running times on several datasets available at the SPMF open source library

I have also created (huim.exe) to run the algorithms on spmf-formatted datasets via the command line conveniently.

Results

Execution times of both algorithms on the following datasets were recorded and plotted. They were also compared with the TKU algorithm (Tseng et al., 2015) on two datasets.

The following running times were achieved on a standard Windows 11 laptop.

Chicago crimes
Ecommerce retail
Foodmart (with TKU)
Fruithut
Mushroom (with TKU)

Randomly generated databases

Random 1
Random 2

Conclusions

  • Both versions vastly outperform the TKU algorithm (Tseng et al., 2015).
  • The priority queue version has an almost uniform running time hardly wavering.
  • The binary search version however does not have a very monotonic performance. Since the maximum utility estimate is usually way beyond the actual maximum utility value, several iterations are required for the search to converge to a meaningful range.
  • The binary search version outperforms the priority queue version on relatively smaller values of $K$.