Beta-learning adds statistical insignificance to irrelevant parameters
Kirluu opened this issue · 3 comments
This issue is partly trying to raise a red flag, and partly trying to grasp why Figaro's learning logic works the way it does. The way it works appears to hinder our efforts at machine learning by introducing statistical insignificance for observations regarding less than oftenly "relevant" or "active" cases in our model.
We'll try to provide a specific, boiled down, example:
Consider the following Figaro model:
class Model(parameters: ParameterCollection, modelUniverse: Universe) {
val flip = Flip(0.1)("myFlip", modelUniverse)
val weightTrue = Flip(parameters.get("weightTrue"))("elem_weightTrue", modelUniverse)
val weightFalse = Flip(parameters.get("weightFalse"))("elem_weightFalse", modelUniverse)
val cpd = RichCPD(flip,
(OneOf(true)) -> weightTrue,
(*) -> weightFalse)("elem_cpd", modelUniverse)
}
Note that the model contains only two regular elements, "cpd" and "flip" and the CPD is constructed using two different weights, one for the case when "flip" is true, and one for when it is false.
The parameters are both initialized as Beta(1,1) elements, postulating the fact that we have no prior beliefs regarding the parameters, and we place these Beta elements inside the ModelParameters construct.
Next, consider the following observations taking place:
for (i <- 1 to 100) {
val model = new Model(modelParameters.priorParameters, universe)
model.flip.observe(true)
model.cpd.observe(true)
}
At this point, we've learned that the "cpd" element very oftenly turns out true, given that "flip" is also true. But what about "cpd"'s probability of being true, when "flip" is false? Indeed, we have had no observations that signify this case for us.
Accordingly, we get the following outcomes for the parameters' MAP values:
weightTrue -> 1.0 (101.0, 1.0)
weightFalse -> 0.5 (51.0, 51.0)
First question: What is up with the counting of the alpha and beta values of "weightFalse" here? As mentioned, we've not seen a single sample of what probability "cpd" should have when "flip" is false. Even so, it would appear that Figaro records 50 cases on either side. This gives us a probability of 50 %, which is actually fine - for now - but what happens when we also start observing things about the case where "flip" is observed to be false? Let's take a look:
We add a single additional Model-instance with observations:
val model = new Model(modelParameters.priorParameters, universe)
model.flip.observe(false)
model.cpd.observe(true)
After this additional evidence is supplied, we expect the following:
- We will maintain the same probability of "cpd" being true when "flip" is known to be true, namely 100 %. This is what we expect since the new observation does not fit into the bucket of cases where "flip" is true.
- We will expect that the probability of "cpd" being true when "flip" is known to be false to change from 50 % (unknown, Beta(1,1), no prior beliefs) to 100 % (having seen exactly one case to support the claim, and none for the opposite case ("cpd" being false when "flip" is false)).
However, here are the actual results:
weightTrue -> 0.9999999951950983 (101.99999951470492, 1.000000485295074)
weightFalse -> 0.5195098277585943 (53.470492603618055, 49.529507396382)
Second question: Why is it that our belief of weightFalse is not "1.0 (2.0, 1.0)" here? We've seen exactly one instance where "flip" was false, thus "activating" or "making relevant" this weight. Why is all this noise of the other "irrelevant" 100 instances present ?
Third question: Similarly for weightTrue: Why does this latest case affect weightTrue. Even in such an odd manner, to add "almost 1" to its learnedAlpha and nearly nothing to its learnedBeta.
Notes:
We used EMWithVE regular training for this test setup. Can provide the specific source-code if desired.
Using Online EM makes no difference.
Building the model without RichCPD (e.g. with If-constructs) makes no difference. Regardless of whether elements are declared before the If or inside of it (when testing whether the laziness of If mattered)
Taking a broader perspective:
In our actual model, we are dealing with a larger model, which contains more CPDs as in this fashion. Essentially, the effect of this odd behaviour has various effects on our training and thus our evaluations:
- All of our learned parameters are "flooded" by the statistical insignificance introduced in the same way as in the example above.
- All of our parameters are heavily "guided" by whatever prior beliefs we supply to the Beta-elements when initializing them.
Basically, Beta(1,1) isn't a special case where a problem exists. Say we initialize a parameter with Beta(2,5) (namely 1 positive case and 4 negative), then with no relevant cases observed, we get 20 % as expected, but not with learnedAlpha and learnedBeta values of 2 and 5 as expected, but instead (with 100 model-instances with observations) learnedAlpha = 22.2 and learnedBeta = 85.8).
This is concerning, since this reproduces the problem examplified above no matter what prior beliefs are passed on, since the significance of one actually relevant data instance wth observations is naturaly extremely low when compared to all data instances with observations in total.
In general, we fear that this voids our usage of the built-in machine learning setup via the EM-algorithm in Figaro.
In case our concerns are actually what is void, we would very much like an explanation that either (a) defends the results observed, with backing in some theoretical foundation that we appear to be missing, or (b) suggests an error in our usage of Figaro, which makes it act strangely, along with a concrete modification to the given example, which will make Figaro behave as we expect.
We highly appreciate any input anyone may have toward a resolution for this problem.
Thank you very much,
Best regards,
Hopeful students of the IT University of Copenhagen,
Christian and Kenneth,
chnr@itu.dk and kulr@itu.dk
Since you helped us before, I am going to go ahead and add a tag for you, @apfeffer :)
Hi @bruttenberg ,
Thank you very much for your informative reply - it certainly helps us understand part of the reasoning behind what seems to be going on.
I do have a follow-up question, considering not only the example given, but also a scenario where the example is expanded upon.
Imagine a RichCPD like the one in the example, but instead of depending on one Flip, it depends on 5 Flip variables. Let's also say that the RichCPD has a match for all possible cases, namely 32 different cases ( |{true, false}|^5 ). For each case, its body maps to a different, unique, parametrized Flip with its own learnable Beta-parameter (still shared across model instances as before).
Next, consider a randomized sequence of model instances with alternating observations for these 5 variables, where every variable is always observed to be either true or false, thus per definition matching exactly one case in the aforementioned RichCPD, for each model instance.
The way that Figaro works (supported by your explanation), we will - for each model instance - get a more accurate learned estimate for the weight for the specific case that this model instance matches, while all other cases simultaneously are altered to increase the confidence in all of their prior beliefs.
Next, imagine that the Beta-elements are initialized as such: Beta(2,10), namely we imagine having seen 1 positive case and 9 negative cases.
Now, considering the randomly generated model instances, no single case C1 will ever be able to "escape" its prior beliefs, even if C1 were to be matched with only positive model instances (where the CPD is observed to be true). This is because the few instances (1 out of 32 at random) where the 5 variables' values will match exactly C will be very statistically insignificant, because the learning from those few instances will be noised out by the 31/32 cases where C1 is not hit (where each time that happens, the prior belief of 1 positive vs. 9 negative is reinforced).
Does this make sense? In our case, for instance, we have an expectation that one of our cases should provide a low probability when looking at the data. If we supply a prior belief that matches this, then it simply gets enforced further and looks reasonable.
But if we flip the prior beliefs (initial alpha and beta values), then it can never escape the doom that is the poorly chosen prior beliefs. Shouldn't the data - in a learning scenario - always be able to trumph the prior beliefs for any and all cases?
I hope the above made some sense to you, and if not, feel free to have me elaborate or make a concrete example.
In short, assuming that Figaro's default manner of training - as you described it - simply does not work in our case; is there some other approach that we may want to consider taking to training?
For instance, we would imagine that something like a plain MLE training would do the trick for us, provided that the same kind of prior-belief learning does not take place. Is there any access point to something like MLE in Figaro? (We have been unable to find anything like it thus far)
Finally, considering your additional suggestion/hint regarding using constraints to nullify the probability of some variables:
Are you suggesting that we could - considering the example I gave in this comment - do this for each of the cases that are not a match for the concrete data instance in scope at some given moment.
That is my interpretation at least, but sadly it appears that (following the original example in my post) adding a constraint to "weightFalse" when iterating the 100 model instances where both "flip" and "cpd" are observed to be true results in no learning for neither "weightFalse" nor "weightTrue", which suggests that that approach doesn't quite work out either.
It would've otherwise been a decent approach to temporarily get us back on track at least.
More importantly: Is there some theoretical backing behind (a) how Figaro does things and (b) the approach you suggested which will "zero out the sufficient statistics"?
- References would be much obliged :)
Best regards,
Kenneth