mgormley/pacaya

Wrong configuration index passed to ExpFamFactors with observed variables

Closed this issue · 1 comments

Producing the Issue

The issue can be produced by running the following simple model. Please find the complete code at this gist.

Suppose we have the following definition of a Boolean variable:

public class BooleanVar extends Var {
    static ArrayList<String> getBooleanStates() {
        ArrayList<String> booleanStates = new ArrayList<String>();
        booleanStates.add("F");
        booleanStates.add("T");
        return booleanStates;
    }

    public BooleanVar(String name, VarType varType) {
        super(varType, 2, name, getBooleanStates());
    }
}

And we have a factor that evaluates 4 BooleanVars:

public class MyFactor extends ExpFamFactor {
    BooleanVar v0;
    BooleanVar v1;
    BooleanVar v2;
    BooleanVar v3;

    public MyFactor(BooleanVar v0, BooleanVar v1, BooleanVar v2, BooleanVar v3) {
        super(new VarSet(v0, v1, v2, v3));
        this.v0 = v0;
        this.v1 = v1;
        this.v2 = v2;
        this.v3 = v3;
    }

    /**
     * A simple feature extractor that returns the feature vector, [1.0], for the case where all 
     * variables are "T", and returns [0.0] for other cases. 
     */
    public FeatureVector getFeatureVectorOf(int value0, int value1, int value2, int value3) {
        if (value0 == 1 && value1 == 1 && value2 == 1 && value3 == 1) {
            double[] values = new double[1];
            values[0] = 1.0;
            return new FeatureVector(values);
        }
        else {
            double[] values = new double[1];
            values[0] = 0.0;
            return new FeatureVector(values);
        }
    }

    @Override
    public FeatureVector getFeatures(int configIdx) {
        System.out.println(configIdx);
        VarConfig curConfig = getVars().getVarConfig(configIdx);
        int value0 = curConfig.getState(this.v0);
        int value1 = curConfig.getState(this.v1);
        int value2 = curConfig.getState(this.v2);
        int value3 = curConfig.getState(this.v3);
        return getFeatureVectorOf(value0, value1, value2, value3);
    }
}

Notice that the first line in the getFeatures method prints out the configIdx, making it possible for us to see what values of configIdx are ever passed to getFeatures. This is how I spotted the issue (explained in detail later).

Construct a simple factor graph using the following code (notice that v0 and v1 are observed variables):

FactorGraph fg = new FactorGraph();
BooleanVar v0 = new BooleanVar("v0", Var.VarType.OBSERVED);
BooleanVar v1 = new BooleanVar("v1", Var.VarType.OBSERVED);
BooleanVar v2 = new BooleanVar("v2", Var.VarType.PREDICTED);
BooleanVar v3 = new BooleanVar("v3", Var.VarType.PREDICTED);

MyFactor f = new MyFactor(v0, v1, v2, v3);
fg.addFactor(f);

And make up an example list with only one example:

VarConfig exampleConf = new VarConfig();
exampleConf.put(v0, "T");
exampleConf.put(v1, "F");
exampleConf.put(v2, "F");
exampleConf.put(v3, "T");

LabeledFgExample example = new LabeledFgExample(fg, exampleConf);
FgExampleMemoryStore examples = new FgExampleMemoryStore();
examples.add(example);

The problem occurs when the following training code is executed:

FgModel model = new FgModel(1);
CrfTrainerPrm prm = new CrfTrainerPrm();
CrfTrainer trainer = new CrfTrainer(prm);
trainer.train(model, examples);

What Is The Issue

From the terminal print out, it is easy to see that the configIdx passed to the getFeatures function in the class MyFactor ranges from 0 to 3, indicating that there are only 4 possible configurations. However, since this factor connects to 4 variables, there are in fact 16 configuration. So a configIdx ranging from 0 to 15 is expected here, instead of one ranging from 0 to 3.

I understand that the range, 0 to 3, is calculated from only the unobserved variables using the two variables whose types are VarType.PREDICTED (v2 and v3):

(new VarSet(v2, v3)).calcNumConfigs()

And I know I could get the correct states with the following code

(new VarSet(v2, v3)).getVarConfig(configIdx)

But what about the current states of v0 and v1, the observed variables?

Possible Cause

The function call that triggered the getFeatures method in MyFactor can be traced back to lines 164 to 166 in ExpFamFactor.java, using the call stack.

164   public double getDotProd(int config, FgModel model, boolean logDomain) {
165       return unclmpFactor.getDotProd(config, model, logDomain);
166   }

This function uses the unclamped factor's getFeatures function. However, the config is from line 68 in the same file, which is a configuration index of the non-observed variables.

Great description of the problem!

In Pacaya v2, OBSERVED variables were intentionally left out of the config index. Since they are observed, they always have the same value -- further, this design decision ensures that factors only need to represent the scores for the possible configurations of the PREDICTED and LATENT variables.

In Pacaya v3 (the current version), OBSERVED variables were removed entirely. This encourages CRF-style use of observed variables. That is, we don't have a variable representing a sentence in a linear-chain CRF POS tagger. Instead, the features can refer to the sentence for free with no adverse performance impact. If the behavior of OBSERVED variables is still desired, one can still "clamp" a factor graph to a particular configuration.