chuanconggao/PrefixSpan-py

How the support value is calculated in PrefixSpan-py module

Udayaprasad opened this issue · 17 comments

Hi Chuanconggao,

I am using your module in one of my research for doing closed frequent sequence mining. BIDE algorithm is obvious choice for doing this. I have few questions regarding support value calculation in this module.

Details given below.

I have 47088 transactions, each was having variable length of items. I have used these 47088 transactions for BIDE training. I have used topk() to extract top 1000 rules. One of the rule listed below.

(1033,
['bdb061461ee36ed04a69b4913a4c50af433d8de1',
'68d234c39fd9217721d3ee22d5b81981f6a19cdc',
'85a66b6b2cbe4e35ffe838f7080eeac4bb67cd1d',
'4728d1f5afc614cbe76fc729de89c6634faecf85'])

Here 1033 is support value. I am under impression that 1033 times this pattern is appeared in 47088 transactions. But actually I found that it exists in 339 times in all the transactions.

Can you explain how this support value is calculated here. Also, let me know if there is any error in calculation.

Thanks,
Uday Vakalapudi

Hi, please keep in mind that a pattern is contained within a transaction, when all of the pattern's item match sequentially within the transaction, with any gap in between of the matched transaction items. Please check the example for more examples.

So when you search the pattern, you cannot just search the whole text. I suspect that 339 times is the one you search using the whole text within text editor. One way to verify is using regular expression, by converting a pattern like abc into a.*b.*c.*.

Please let me know if there is any further issue.

Thank you for your reply. I am actually doing rolling window comparison. Below code will does the same.

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c
pattern = np.asarray(['bdb061461ee36ed04a69b4913a4c50af433d8de1',
'68d234c39fd9217721d3ee22d5b81981f6a19cdc',
'85a66b6b2cbe4e35ffe838f7080eeac4bb67cd1d',
'4728d1f5afc614cbe76fc729de89c6634faecf85'])
N = len(pattern)  
arr = events_train['id'].values
b = np.all(rolling_window(arr, N) == pattern, axis=1)
lenn = len([i for i in b if i == True])

At the end final variable "lenn" should return 1033, but it is finding only 339. This is rolling window comparison with stride 1.

Thanks,
Uday Vakalapudi

Rolling window matching is still whole text matching, by moving from left to right, right? Sorry, I do not use NumPy a lot.

This has two issues. One, it does not consider matching with gap. Two, it may count the transaction more than once if it contains the pattern at more than one places.

Please also try regex matching to help verify. You can do it easily in most text editor.

Rolling window considers matching with gap. This is how it works.
our pattern rule = [1,2,3]
list = [0,1,2,3,4,5,1,2,3]

BIDE: Efficient Mining of Frequent Closed Sequences

rolling window 1 of length 3 (length of the pattern): if [1,2,3] == [0,1,2]: False
rolling window 2: if [1,2,3] == [1,2,3]: True
rolling window 3: if [1,2,3] == [2,3,4]: False
rolling window 4: if [1,2,3] == [3,4,5]: False
rolling window 5: if [1,2,3] == [4,5,1]: False
rolling window 6: if [1,2,3] == [5,1,2]: False
rolling window 7: if [1,2,3] == [1,2,3]: True

Hence from the above main list we have found that there are two patterns (2 True values). I have to perform this using the rules mined from BIDE algorithm.

It seems you have misunderstanding of matching with gap on pattern mining.

If a pattern [1, 2, 3] matches [1, 2, 3, 4], it is exact subsequence match. If the same pattern matches [1, 2, 2, 3], it is subsequence match with gap.

In your example, it is always exact match, just matched twice within same transaction.

What you want is exact match. Please check the definition of pattern mining and make sure this is what you can use.

Hey,
I think there are two types of sequences, one is closed and other one is open. In Open sequence mining we consider match with gap but, in closed sequence mining we will consider strict/exact pattern matches.

Your example:

from prefixspan import PrefixSpan

db = [
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2],
]

ps = PrefixSpan(db)
print(ps.topk(5, closed=True))

This outputs exact match scenarios.
[(3, [1, 2]), (2, [1, 1, 1]), (2, [1, 2, 2]), (2, [1, 3, 4]), (1, [1, 1, 1, 3, 4])]

Sorry, the definition of closed pattern mining is not exact matching. Please check the paper in README.md to get the exact definition. In general, it means for each closed pattern there is no super-pattern with same support.

Thank you for your support. I may be wrong. I will read this paper. Can you explain me how to apply these extracted rules on test data to check their sequence match?

I thought just doing rolling window comparison will do the job. Please provide your suggestions to mark the sequence rules in test data.

Thanks,
Uday Vakalapudi

In general, if you want to know the support of a pattern like "abc", you can build a regular expression "a.*b.*c". You can then do find with regex in most modern text editor.

Please keep in mind that support is defined as the number of transactions containing the pattern, not the number of appearance.

"a.*b.*c" regex gives us number of transactions containing the pattern (Support), which is essentially 1033 in my case. Please explain then how 339 count I got from overall transactions?
Sorry If I don't make sense.

1033 is number of transactions containing the pattern via matching with gap.

339 is number of exact matchings of the pattern among all transactions you did previously.

One is matching with gap and only once per transaction. Another is exact matching for all positions among all transactions.

One is matching with gap < Exact matching for all transactions ?
Not True?

abcdabceabc
abcdabceabc
abcdabceabc

Support of "abc" = 3 (Pattern having 3 transactions)
Exact match of "abc" = 9 (Exact matching for all transactions: 3*3=9)

So,
Matching with gap < Exact matching for all transactions

One final clarification sir.

In your example, yes, the support of "abc" is 3 as you have only 3 transactions.

However, in the following example, you have 6 against 2 for pattern "abc":

aabbcc
abbc
abc
aaabbbccc
abcc
abbbbbccc

I am not sure about the size of dataset you used at the first place to get the number 1033 and 339. Anyway, this is all data relevant.

I have used ~1.4 million data points. These data points are used to form 47088 transactions. I have extracted top 1000 rules (Closed patterns - Assuming transaction with at least one strict pattern match in it). One of the top rule is
(1033,
['bdb061461ee36ed04a69b4913a4c50af433d8de1',
'68d234c39fd9217721d3ee22d5b81981f6a19cdc',
'85a66b6b2cbe4e35ffe838f7080eeac4bb67cd1d',
'4728d1f5afc614cbe76fc729de89c6634faecf85'])

I have just verified whether I will get at least 1033 strict matches in the same 1.4 million dataset. I end up having 339 exact matches. That is the reason I am looking for your help that how support value is calculated. Hope you understand my end goal of this problem.

Support Value <= Exact matchings for all transactions ? Isn't it?

Thank you for taking time to answer all my questions.

Like I mentioned in my previous example, it is possible and very likely for support to be larger than number of exact matches.

It not always yes or no, but depends on your data. There is not definitive answer.

I have never seen a person like you, answering so patiently. Please send few example scenarios where Support value is larger than number of exact matches, If you can try to put it general examples.

Thank you so much sir

Please try the gazelle dataset used by BIDE. It is publicly and you can easily find it by Google.