/fim

fim is a collection of some popular frequent itemset mining algorithms implemented in Go.

Primary LanguageGoMIT LicenseMIT

fim

Build Status Go Report Card License

fim is a collection of some popular frequent itemset mining algorithms implemented in Go.

fim contains the implementations of the following algorithms:

  • apriori: Based on Fast Algorithms for Mining Associations Rules. As an optimization, the approach to determine the support of the candidates with a hash-tree and a transaction bitmap was replaced with an approach based on a trie. As a further optimization, the generation of candidates of length 2 is omitted in favor of counting each occurring item pair in a 2D map.
  • eclat: As described in Scalable Algorithms for Association Mining. The paper describes and analyses several approaches. The eclat algorithm is implemented based on the approach named Eclat. For the determination of frequent itemsets of length 2, a 2D map is used instead of a 2D array as described in the paper.
  • fpgrowth: Based on Mining Frequent Patterns without Candidate Generation. Implemented mostly as described in the paper. The only difference is that there is no special-casing implemented for FP-trees that are paths.

Build

Execute the following commands to build the fim tool:

git clone https://github.com/paulfedorow/fim.git
cd fim
make

Usage

To see which arguments fim supports execute the following command:

build/fim -h

The following example finds all frequent itemsets with a minimal support of 1% in the dataset contained in datasets/retail.dat:

build/fim -a fpgrowth -i datasets/retail.dat -s 0.01 

Dataset format

fims dataset file format is as follows:

File = {Transaction}
Transaction = Item {" " Item} "\n"
Item = {"0" ... "9"}

Each line in the file is a transaction. A line is expected to be a series of integers separated by a single space. Each integer is an item of the corresponding transaction.

Performance

To determine which algorithm is the most efficient, the runtime of each algorithm was measured with different datasets and decreasing minimal support. The datasets that were used are retail.dat and chess.dat from FIMI Dataset Repository. The datasets are respectively sparse and dense. The results are shown below. The algorithm fpgrowth is best in terms of runtime.

Runtime chart for datasets/retail.dat

Runtime chart for datasets/chess.dat