/PrefixSpan-py

The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

Primary LanguagePythonMIT LicenseMIT

PyPI version PyPI pyversions PyPI license

Featured on ImportPython Issue 173. Thank you so much for support!

The shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm PrefixSpan, the famous frequent closed sequential pattern mining algorithm BIDE (in closed.py), and the frequent generator sequential pattern mining algorithm FEAT (in generator.py), as a unified and holistic algorithm framework.

  • BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.

  • FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.

For simpler code, some general purpose functions have been moved to be part of a new library extratools.

Reference

Research Papers

PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.
Proceedings of the 17th International Conference on Data Engineering, 2001.
BIDE: Efficient Mining of Frequent Closed Sequences.
Jianyong Wang, Jiawei Han.
Proceedings of the 20th International Conference on Data Engineering, 2004.
Efficient mining of frequent sequence generators.
Chuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.
Proceedings of the 17th International Conference on World Wide Web, 2008.

Alternative Implementations

I created this project with the original minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist here for reference.

You can also try my Scala version of PrefixSpan.

Features

Outputs traditional single-item sequential patterns, where gaps are allowed between items.

  • Mining top-k patterns is supported, with respective optimizations on efficiency.

  • You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.

  • Custom key function, custom filter function, and custom callback function can be applied.

Installation

This package is available on PyPI. Just use pip3 install -U prefixspan to install it.

CLI Usage

You can simply use the algorithms on terminal.

Usage:
    prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]

    prefixspan-cli --help


Options:
    --text             Treat each item as text instead of integer.

    --closed           Return only closed patterns.
    --generator        Return only generator patterns.

    --key=<key>        Custom key function. [default: ]
                       Must be a Python function in form of "lambda patt, matches: ...", returning an integer value.
    --bound=<bound>    The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]
                       Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
                       Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).

    --filter=<filter>  Custom filter function. [default: ]
                       Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.

    --minlen=<minlen>  Minimum length of patterns. [default: 1]
    --maxlen=<maxlen>  Maximum length of patterns. [default: 1000]
  • Sequences are read from standard input. Each sequence is integers separated by space, like this example:
cat test.dat

0 1 2 3 4
1 1 1 3 4
2 1 2 2 0
1 1 1 2 2
  • When dealing with text data, please use the --text option. Each sequence is words separated by space, assuming stop words have been removed, like this example:
cat test.txt

a b c d e
b b b d e
c b c c a
b b b c c
  • The patterns and their respective frequencies are printed to standard output.
prefixspan-cli frequent 2 test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 : 2
1 3 4 : 2
1 4 : 2
1 1 : 2
1 1 1 : 2
2 : 3
2 2 : 2
3 : 2
3 4 : 2
4 : 2
prefixspan-cli frequent 2 --text test.txt

a : 2
b : 4
b c : 3
b c c : 2
b d : 2
b d e : 2
b e : 2
b b : 2
b b b : 2
c : 3
c c : 2
d : 2
d e : 2
e : 2

API Usage

Alternatively, you can use the algorithms via API.

from prefixspan import PrefixSpan

db = [
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2],
]

ps = PrefixSpan(db)

For details of each parameter, please refer to the PrefixSpan class in prefixspan/api.py.

print(ps.frequent(2))
# [(2, [0]),
#  (4, [1]),
#  (3, [1, 2]),
#  (2, [1, 2, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4]),
#  (2, [1, 4]),
#  (2, [1, 1]),
#  (2, [1, 1, 1]),
#  (3, [2]),
#  (2, [2, 2]),
#  (2, [3]),
#  (2, [3, 4]),
#  (2, [4])]

print(ps.topk(5))
# [(4, [1]),
#  (3, [2]),
#  (3, [1, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4])]


print(ps.frequent(2, closed=True))

print(ps.topk(5, closed=True))


print(ps.frequent(2, generator=True))

print(ps.topk(5, generator=True))

Closed Patterns and Generator Patterns

The closed patterns are much more compact due to the smaller number.

  • A pattern is closed if there is no super-pattern with the same frequency.
prefixspan-cli frequent 2 --closed test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 4 : 2
1 1 1 : 2

The generator patterns are even more compact due to both the smaller number and the shorter lengths.

  • A pattern is generator if there is no sub-pattern with the same frequency.

  • Due to the high compactness, generator patterns are useful as features for classification, etc.

prefixspan-cli frequent 2 --generator test.dat

0 : 2
1 1 : 2
2 : 3
2 2 : 2
3 : 2
4 : 2

There are patterns that are both closed and generator.

prefixspan-cli frequent 2 --closed --generator test.dat

0 : 2

Custom Key Function

For both frequent and top-k algorithms, a custom key function key=lambda patt, matches: ... can be applied, where patt is the current pattern and matches is the current list of matching sequence (id, position) tuples.

  • In default, len(matches) is used denoting the frequency of current pattern.

  • Alternatively, any key function can be used. As an example, sum(len(db[i]) for i in matches) can be used to find the satisfying patterns according to the number of matched items.

  • For efficiency, an anti-monotone upper-bound function should also be specified for pruning.

    • If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
# [(20, [1]),
#  (15, [2]),
#  (15, [1, 2]),
#  (10, [1, 3]),
#  (10, [1, 3, 4])]

Custom Filter Function

For both frequent and top-k algorithms, a custom filter function filter=lambda patt, matches: ... can be applied, where patt is the current pattern and matches is the current list of matching sequence (id, position) tuples.

  • In default, filter is not applied and all the patterns are returned.

  • Alternatively, any function can be used. As an example, matches[0][0] > 0 can be used to exclude the patterns covering the first sequence.

print(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))
# [(2, [1, 1]),
#  (2, [1, 1, 1]),
#  (2, [1, 2, 2]),
#  (2, [2, 2]),
#  (1, [1, 2, 2, 0])]

Custom Callback Function

For both the frequent and the top-k algorithm, you can use a custom callback function callback=lambda patt, matches: ... instead of returning the normal results of patterns and their respective frequencies.

  • When callback function is specified, None is returned.

  • For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.

  • The following example finds the longest frequent pattern covering each sequence.

coverage = [[] for i in range(len(db))]

def cover(patt, matches):
    for i, _ in matches:
        coverage[i] = max(coverage[i], patt, key=len)


ps.frequent(2, callback=cover)

print(coverage)
# [[1, 3, 4],
#  [1, 3, 4],
#  [1, 2, 2],
#  [1, 2, 2]]

Tip

I strongly encourage using PyPy instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a virtual environment created for PyPy.

Note that only the earlier version 0.4 works for the latest PyPy3 6.0.0 (compatible with Python 3.5.3). Please install it via pip3 install prefixspan==0.4. Latest version should work for the future PyPy3 (compatible with Python 3.6).