H_1 is not reduced, unnecessary rule candidate generation/checking
Closed this issue · 3 comments
Ok, let's do this the proper way. My PR #44 was not working because I prematurely tried to fix something that I did not fully understand.
Description
When generating rule candidates with 1-item consequents, no pruning is applied. This leads to unnecessary confidence computations for candidates that are below the confidence threshold.
Minimum Working Example
Going from the docstring for generate_rules_apriori, the following example will generate unnecessary candidates
transactions = [
('a', 'b', 'c', 'd'),
('a', 'b', 'c'),
]
n = len(transactions)
min_sup = 0.5
min_conf = 1.0
itemsets, _ = itemsets_from_transactions(transactions, min_sup, max_length=4, verbosity=1)
rules = list(generate_rules_apriori(itemsets, min_conf, n, verbosity=1))
When deriving rules of size 4, all of these candidates are being checked:
1) {b, c, d} -> {a} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
2) {a, c, d} -> {b} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
3) {a, b, d} -> {c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
4) {a, b, c} -> {d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
5) {c, d} -> {a, b} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
6) {b, d} -> {a, c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
7) {b, c} -> {a, d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
8) {a, d} -> {b, c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
9) {a, c} -> {b, d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
10) {a, b} -> {c, d} (conf: 0.500, supp: 0.500, lift: 1.000, conv: 1.000)
11) {d} -> {a, b, c} (conf: 1.000, supp: 0.500, lift: 1.000, conv: 0.000)
In 10) we check {a, b} -> {c, d}
although 4) {a, b, c} -> {d} (conf: 0.500)
is already below the confidence threshold of 1.0
.
Even the docstring says that it is not necessary to check 10) due to confidence-based pruning.
(In my fork, I added a few print statements to easily track what's being done for this very example in h_1_mwe.py
)
Tasks
- confirm that this behavior is not intended
- find a way to prune H_1 properly
- test whether it brings any speed improvement
Great job writing this clear issue @selting ! Thanks a lot! Do you have time to investigate further and propose a solution? If not I will try to take a look at it when I have the time, which might not be any time soon.
If I find the time, I will happily give it a try. But similarly to you, this might take a while. The issue is just performance-related, so I guess it's not crucial to fix it immediately.
Closed in #53 by @hprshayan