Algorithm outputs a series of repeated items but there are none in the training data

Question

Algorithm outputs a series of repeated items but there are none in the training data

Closed this issue 6 years ago · 5 comments

Hallo,

I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it's not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].

When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?

For example, given the code:
seqs = [[22, 16],
[22, 21],
[22, 16, 14, 20],
[22, 16],
[22, 16, 34, 24, 26, 24, 26, 14, 13],
[22, 16],
[22, 26],
[22, 13, 34],
[22, 16],
[22, 21, 16]]

ps = PrefixSpan(seqs)
ps.minlen = 2
ps.maxlen = 10

freq_ratio = 0.1
freq = np.ceil(freq_ratio * len(seqs)).astype(int)

res = ps.frequent(freq)

The output has [26, 26, 14, 13]

I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.

Thanks

Answer 1 · 2018-11-21T16:06:30.000Z

Hi, your relative support threshold is 0.1. Thus, your absolute support threshold is 1 for your input of 10 sequences. This means it will generate all the possible subsequences with gap in between.

…

On Wed, Nov 21, 2018 at 6:30 AM marcwell ***@***.***> wrote: Hallo, I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it's not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2]. When I generated the most frequent sequences, though, I obtained repeated items. Is it possible? For example, given the code: `seqs = [[22, 16], [22, 21], [22, 16, 14, 20], [22, 16], [22, 16, 34, 24, 26, 24, 26, 14, 13], [22, 16], [22, 26], [22, 13, 34], [22, 16], [22, 21, 16]] ps = PrefixSpan(seqs) ps.minlen = 2 ps.maxlen = 10 freq_ratio = 0.1 freq = np.ceil(freq_ratio * len(seqs)).astype(int) res = ps.frequent(freq)` The output has [26, 26, 14, 13] I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains. Thanks — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#11>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGpCEaafp14gtdZ_lZ3Bl_5fV-1t3bQLks5uxWOYgaJpZM4YtO-t> .

Answer 2 · 2018-11-22T09:04:37.000Z

Thank you for your answer! Here I just generated a small set of rules, so that it can fit in a post, but it happens also on the set of ~1000 sequences I'm analysing, like:

[22, 30, 30] with support 156 (13.3%)

Is it normal?

Answer 3 · 2018-11-24T17:47:33.000Z

I am really not sure with just description. Can you provide a tiny sample?

Answer 4 · 2018-11-26T12:10:50.000Z

I have attached a file with some example sequences. It does not contain sequences with repeated items (i.e. where the same number appears once and then immediately again) but in the output I obtain, for example:

(156, [22, 30, 30])

Thanks for your help

Attached file: seqs.txt

Answer 5 · 2018-12-06T07:37:32.000Z

Hi, you seem to misunderstand the concept of pattern.

For example for one of your provided sequence [22, 1, 30, 1, 24, 30], pattern []22, 30, 30 IS a sub-pattern of this sequence. It is allowed to have other items in between.