mhahsler/arules

Negative Odds Ratios

zerweck opened this issue · 5 comments

I recently came across a rule with a huge negative Odds Ratio:

> rules[69804] %>% interestMeasure(transactions, "oddsRatio") 
[1] -5.954607e+20

I guess it's hard to replicate without supplying my whole dataset, but i tried to track the reason down;
In the function .getCounts, the variable f01 becomes negative.
It is created by subtracting fx1 - f11, i.e.

.rhsSupport(x, transactions, reuse) * N - interestMeasure(x, "support", transactions, reuse) * N

or as I (hopefully correctly?) interpret it, supp(Y) - supp(X=>Y), which should never become negative.

Here are the fully printed raw numbers, where you can see that f11 is in fact represented larger than fx1.

> sprintf("%.66f",.rhsSupport(x, transactions, reuse))
[1] "0.000092340537126436346296656787480117145605618134140968322753906250"
> sprintf("%.66f",interestMeasure(x, "support", transactions, reuse))
[1] "0.000092340537126436359853520752238864588434807956218719482421875000"

I guess the error happens somewhere at a lower level like arules::support() or even at the C Level, but I didn't track it down any further. It probably is an R floating point rounding error somewhere.
I just wanted to give attention to this event, maybe a simple check could be implemented here and negative numbers replaced by zero with a warning.

Clarification Edit: Of course, only the result of this subtraction should be rounded to zero, so the division yields Inf which is within the definition range for an odds ratio.

Please check if the transactions supplied to interestMeasure() are exactly the same as used to mined the rules. If that is not the case, then all kinds of bad things can happen and you should use rules %>% interestMeasure("oddsRatio", transactions, reuse = FALSE)

They are, transactions is the exact same R Object from a call to apriori before.
However, i will try to reproduce the error with 'reuse=FALSE' and report back with the results

Update: I did retry with reuse=FALSE and it did not change the result.

Here is the count parameter list as seen by .basicRuleMeasure in both cases:

count_reuse_TRUE <- list(f11 = 9.59736968042681, f1x = 10.397150487129, 
     fx1 = 9.59736968042681, f0x = 270726.602849513, fx0 = 270727.40263032, 
     f10 = 0.799780806702232, f01 = -1.77635683940025e-15, f00 = 270726.602849513, 
     N = 270737L)
count_reuse_FALSE <- list(f11 = 9.59736968042681, f1x = 10.397150487129, 
     fx1 = 9.59736968042681, f0x = 270726.602849513, fx0 = 270727.40263032, 
     f10 = 0.799780806702232, f01 = -1.77635683940025e-15, f00 = 270726.602849513, 
     N = 270737L)
> identical(count_reuse_TRUE, count_reuse_FALSE)
[1] TRUE

This is really strange! The numbers you have should all be counts and thus integers or even with rounding problems very close to integers... I need code and data to reproduce the problem so I can fix it.

Strange, I must have made a mistake and can't reproduce my own results. I now get integer values, and the result is a 0/0 odds ratio, so NaN.

count_reuse_FALSE <- list(f11 = 0, f1x = 8, fx1 = 0, f0x = 270729, fx0 = 270737, 
    f10 = 8, f01 = 0, f00 = 270729, N = 270737L)
# oddsRatio:
> f11 * f00/(f10 * f01)
[1] NaN

So you were completely right in your initial assumption that only the reuse parameter needs to be set to FALSE. I am still not sure why this gives different results to me than calling it with reuse=TRUE, since my transactions should be the same. But maybe i need to give this a closer look as well. Sorry for your time.