chuanconggao/PrefixSpan-py

wrong results

MahmoudAdel-hub opened this issue · 3 comments

prefix = PrefixSpan(basket)
prefix.topk(10)

[(47, [(3657,)]), (42, [(3655,)]), (23, [(1915,)]), (13, [(1284,)]), (12, [(2098,)]), (11, [(372,)]), (10, [(3655,), (3655,)]), (9, [(395,)]), (9, [(660,)]), (9, [(1566,)])]

3657 appears 47 !!

when i use SPMF library it gave me 242 !!
i checked it manually on the first 10 sequences this library's algorithm gave me 6 times when i check manually i saw it 3 times also SPMF gave me this result

There are so many details missing, like your dataset. Without that, it is impossible to know whether we have correct output or not.

Also, this is for sequence mining, where order matters. From your variable name of basket, you may be using it for itemset mining?

There are so many details missing, like your dataset. Without that, it is impossible to know whether we have correct output or not.

Also, this is for sequence mining, where order matters. From your variable name of basket, you may be using it for itemset mining?

[[(3224, 1242, 2648, 1348, 1616, 1617, 1618, 1619, 1620, 1647, 1570, 1571, 1572, 1573, 1574, 1097, 3263, 390, 1082, 3107, 3129, 3128, 3130, 1350, 1353, 3328, 328, 748, 2887, 202, 204), (2921, 2922, 3224, 163, 1242, 1243, 270, 1570, 1571, 1572, 1573, 1574, 1083, 3124, 3123, 905, 1278, 2587, 376, 315, 311, 1284, 3107, 1027, 815, 1283, 1353, 748, 2887), (1243, 1241, 1239, 1242, 126, 1907, 784, 1402, 2004, 1293, 1619, 1082, 1083, 904, 315, 1284, 1545, 748, 1348, 2888, 2889, 25, 1663, 1353), (1915, 1989, 1998, 1999, 1997, 2000, 2001, 2002, 1239, 1242, 3279, 84, 1973, 1083, 2887, 2623, 1572, 1570), (2122, 2111, 1242, 1241, 1243, 1239, 1240, 641, 163, 1572, 1973, 1974, 2587, 3124, 3123, 904, 1284, 2002, 2887, 1829, 784, 2130), (2283, 455, 675, 1240, 1242, 1239, 1241, 1128, 1783, 1284, 2000, 2597, 2591, 2587, 1544, 1543, 2887, 1915, 748, 2004, 784, 2310, 2308, 2305, 1829, 1412, 1348, 1470, 1973, 2226, 2227, 2225, 1545, 1570, 1573, 1571, 1572, 905, 1278, 2122, 3123, 3124, 1083, 1082, 163, 1989, 1024), (2299, 2354, 328, 2921, 748, 1915, 163, 455, 2089, 2308, 2310)], [(3123, 420, 418, 3120, 1790, 906, 3657), (746, 3657), (1908, 1909, 1907, 1298, 3657), (1908, 1909, 3657)], [(527, 959, 433, 1495, 1624, 1284, 1502, 1015, 764, 765, 763, 1693, 723, 1479, 1625), (3657, 1502, 2797, 1031, 168 ,,,,]

As the dataset is still incomplete, it is impossible to verify.

As there is no one else reporting same issue, I am closing it right now.

You did mention example of 3657 appearing x times using this algorithm vs. y times using another different algorithm. You can simply use command line tools like grep to find number of lines containing 3657 and that should match this algorithm, assuming your dataset has one sequence per line. Remember, this algorithms gives you number of sequences containing the pattern, not number of times it appears (which can be more than once in same sequence).