Algorithm outputs a series of repeated items but there are none in the training data
Closed this issue · 5 comments
Hallo,
I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it's not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].
When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?
For example, given the code:
seqs = [[22, 16],
[22, 21],
[22, 16, 14, 20],
[22, 16],
[22, 16, 34, 24, 26, 24, 26, 14, 13],
[22, 16],
[22, 26],
[22, 13, 34],
[22, 16],
[22, 21, 16]]
ps = PrefixSpan(seqs)
ps.minlen = 2
ps.maxlen = 10
freq_ratio = 0.1
freq = np.ceil(freq_ratio * len(seqs)).astype(int)
res = ps.frequent(freq)
The output has [26, 26, 14, 13]
I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.
Thanks
Thank you for your answer! Here I just generated a small set of rules, so that it can fit in a post, but it happens also on the set of ~1000 sequences I'm analysing, like:
[22, 30, 30] with support 156 (13.3%)
Is it normal?
I am really not sure with just description. Can you provide a tiny sample?
I have attached a file with some example sequences. It does not contain sequences with repeated items (i.e. where the same number appears once and then immediately again) but in the output I obtain, for example:
(156, [22, 30, 30])
Thanks for your help
Attached file: seqs.txt
Hi, you seem to misunderstand the concept of pattern.
For example for one of your provided sequence [22, 1, 30, 1, 24, 30]
, pattern []22, 30, 30
IS a sub-pattern of this sequence. It is allowed to have other items in between.