Understanding and implementing the EFIM algorithm to solve the top-K HUIM problem.
Basic definitions
Transaction database
Let
Utility of an itemset in a transaction
The utility of an itemset
Utility of an itemset in a transaction database
Naturally, the utility of an itemset
The problem of top
This problem is particularly interesting as the utility measure is neither monotonic nor anti-monotonic, i.e., the utility of an itemset may be lower, higher or equal to that of its subsets.
Among several others, EFIM (Zida et al., 2015) is a high utility itemset mining algorithm. It finds all itemsets that have utility higher than a given a minimum utility threshold using pruning strategies that greatly reduce the search space.
Rather than having a fixed minimum utility threshold, we start with
We binary search on the minimum utility threshold. However, a naive binary search is rather slow because it takes around
The above repo contains C++ implementations of the above ideas and a comparison of their running times on several datasets available at the SPMF open source library
I have also created (huim.exe) to run the algorithms on spmf-formatted datasets via the command line conveniently.
Execution times of both algorithms on the following datasets were recorded and plotted. They were also compared with the TKU algorithm (Tseng et al., 2015) on two datasets.
The following running times were achieved on a standard Windows 11 laptop.
- Both versions vastly outperform the TKU algorithm (Tseng et al., 2015).
- The priority queue version has an almost uniform running time hardly wavering.
- The binary search version however does not have a very monotonic performance. Since the maximum utility estimate is usually way beyond the actual maximum utility value, several iterations are required for the search to converge to a meaningful range.
- The binary search version outperforms the priority queue version on relatively smaller values of
$K$ .