tommyod/Efficient-Apriori

H_1 is not reduced, unnecessary rule candidate generation/checking

Closed this issue · 3 comments

Ok, let's do this the proper way. My PR #44 was not working because I prematurely tried to fix something that I did not fully understand.

Description

When generating rule candidates with 1-item consequents, no pruning is applied. This leads to unnecessary confidence computations for candidates that are below the confidence threshold.

Minimum Working Example

Going from the docstring for generate_rules_apriori, the following example will generate unnecessary candidates

transactions = [
    ('a', 'b', 'c', 'd'),
    ('a', 'b', 'c'),
]
n = len(transactions)
min_sup = 0.5
min_conf = 1.0

itemsets, _ = itemsets_from_transactions(transactions, min_sup, max_length=4, verbosity=1)
rules = list(generate_rules_apriori(itemsets, min_conf, n, verbosity=1))

When deriving rules of size 4, all of these candidates are being checked:

1) 	{b, c, d} -> {a} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
2) 	{a, c, d} -> {b} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
3) 	{a, b, d} -> {c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
4) 	{a, b, c} -> {d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
5) 	{c, d} -> {a, b} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
6) 	{b, d} -> {a, c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
7) 	{b, c} -> {a, d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
8) 	{a, d} -> {b, c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
9) 	{a, c} -> {b, d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
10)	{a, b} -> {c, d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
11)	{d} -> {a, b, c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)

In 10) we check {a, b} -> {c, d} although 4) {a, b, c} -> {d} (conf: 0.500) is already below the confidence threshold of 1.0.
Even the docstring says that it is not necessary to check 10) due to confidence-based pruning.

(In my fork, I added a few print statements to easily track what's being done for this very example in h_1_mwe.py)

Tasks

  • confirm that this behavior is not intended
  • find a way to prune H_1 properly
  • test whether it brings any speed improvement

Great job writing this clear issue @selting ! Thanks a lot! Do you have time to investigate further and propose a solution? If not I will try to take a look at it when I have the time, which might not be any time soon.

If I find the time, I will happily give it a try. But similarly to you, this might take a while. The issue is just performance-related, so I guess it's not crucial to fix it immediately.

Closed in #53 by @hprshayan