chuanconggao/PrefixSpan-py

Algorithm outputs a series of repeated items but there are none in the training data

Closed this issue · 5 comments

Hallo,

I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it's not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].

When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?

For example, given the code:
seqs = [[22, 16],
[22, 21],
[22, 16, 14, 20],
[22, 16],
[22, 16, 34, 24, 26, 24, 26, 14, 13],
[22, 16],
[22, 26],
[22, 13, 34],
[22, 16],
[22, 21, 16]]

ps = PrefixSpan(seqs)
ps.minlen = 2
ps.maxlen = 10

freq_ratio = 0.1
freq = np.ceil(freq_ratio * len(seqs)).astype(int)

res = ps.frequent(freq)

The output has [26, 26, 14, 13]

I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.

Thanks

Thank you for your answer! Here I just generated a small set of rules, so that it can fit in a post, but it happens also on the set of ~1000 sequences I'm analysing, like:

[22, 30, 30] with support 156 (13.3%)

Is it normal?

I am really not sure with just description. Can you provide a tiny sample?

I have attached a file with some example sequences. It does not contain sequences with repeated items (i.e. where the same number appears once and then immediately again) but in the output I obtain, for example:

(156, [22, 30, 30])

Thanks for your help

Attached file: seqs.txt

Hi, you seem to misunderstand the concept of pattern.

For example for one of your provided sequence [22, 1, 30, 1, 24, 30], pattern []22, 30, 30 IS a sub-pattern of this sequence. It is allowed to have other items in between.