PrefixSpan can't find certain patterns
JSAbrahams opened this issue · 7 comments
Currently Prefix-Span won't generate all possible sub-sequences of a pattern, which means that it is not a correct implementation of the algorithm as put forward by Pei et al.
To illustrate, say we have as output sequence [node1, node2, node3, node4, node5]
. In addition to the sequence itself, Prefix-Span will find all sub-sequences except:
[node2, node3, node4, node5]
[node2, node3, node4]
[node3, node4, node5]
It will find entire sequences, but not all sub-sequences. So at the moment, it is sufficient to generate test cases. However this should eventually be fixed as it is not a proper implementation of the algorithm, and does not function as it should/how we expect it to.
I've experimented a bit further with longer and smaller paths, and this seems to be the pattern (no pun intended): The algorithm finds all subpatterns of lengths 1 and 2, without issues. For all subpattern-lengths greater than 2, it only finds the subpatterns starting on the first node of the frequent sequence (as opposed to those that start somewhere in the middle). This means that it finds exactly one of length 3, 4, and so on. I haven't found a solution yet, but this might help :)
I had found something along those lines as well, but thank you for sharing! I will look into it further soon.
It seems that the issue is that the condition pathContainsPrefix
checks for both the prefix plus the item and the last item of the prefix plus the item. However the body of the if makes no distinction between these cases meaning the new prefix is always the entire prefix plus the item and never the last item of the prefix plus the item.
If I modify parts of the PrefixSpan algorithm to read like this:
private fun runAlgorithm(prefix: List<Node> = emptyList(), projectedPaths: Collection<List<Node>> = allPaths) {
frequentItems.forEach { frequentItem ->
if (projectedPaths.any { pathContainsPrefix(it, prefix, frequentItem) || pathContainsLastOfPrefix(it, prefix, frequentItem) }) {
val newPrefix = prefix + frequentItem
frequentSequences += newPrefix
runAlgorithm(newPrefix, extractSuffixes(prefix, allPaths))
}
if (projectedPaths.any { pathContainsLastOfPrefix(it, prefix, frequentItem) }) {
val newPrefix = prefix.last() + frequentItem
frequentSequences += newPrefix
runAlgorithm(newPrefix, extractSuffixes(prefix, allPaths))
}
}
}
private fun pathContainsPrefix(path: List<Node>, prefix: List<Node>, frequentItem: Node) =
pathContainsSequence(path, prefix + frequentItem, comparator)
private fun pathContainsLastOfPrefix(path: List<Node>, prefix: List<Node>, frequentItem: Node) =
prefix.isNotEmpty() && pathContainsSequence(path, listOf(prefix.last(), frequentItem), comparator)
... we get 14 out of 15 patterns. The pattern we're missing for length 5 is (2, 3, 4, 5). These are more patterns but unfortunately the change also produces a StackOverflow exception elsewhere. I haven't found the cause yet...
Hmm this does seem weird. It seems to be that the second if
will never be true if the first if
wasn't true. And the condition of the first if
is functionally identical to the current one, with the only difference being that the function has been split in two (pathContainsPrefix
and pathContainsLastofPrefix
) and that the or
operator has been lifted out.
I can't really figure out why more patterns are found here.
No, there is another difference. Read this line in the second if
carefully: val newPrefix = prefix.last() + frequentItem
As PrefixSpan is no longer part of this repository, I have created an issue there and I'll this one.