CogComp/cogcomp-nlp

View doesn't know about identical constituents

mayhewsw opened this issue · 7 comments

If a view has identical constituents (label, span, score) on token i, then getConstituentsCoveringToken(i) always returns a single constituent even though multiple constituents are there.

This is because View.java:453 uses a Set to aggregate constituents.

I think the initial thinking was that there won't be any overlapping constituents (with exactly the same boundaries) in each view.
What is your use case?

There is no restriction on adding identical (or overlapping) constituents. I guess if that's allowed (which it should be) then it should report the correct number that cover a token.

Just to make sure I understand your point I summarize what you just said,

  • It should be allowed to ad identical constituents (maybe with a warning?)
  • when using the getConstituentsCoveringToken we should be able to retrieve all the identical constituents (which according to Stephen, it is not the case)

Yes, although I'm rethinking the idea of allowing identical constituents now... maybe that shouldn't be allowed. Alternatively, it could just keep constituents in a set, so identical constituents are folded together.

Either way, getConstituentsCoveringToken should reflect the number, whether or not we allow identical.

Makes sense.

@mssammon do you have any thoughts on this?

Looking at the Constituent hashcode() and equals methods, the observed behavior only happens when the relevant constituents are identical. SpanLabelView, in theory, offers control over whether duplicate constituents can be added (I think actually this is not sufficiently strictly enforced) but that's not true for most other View types.
Personally, I don't see the point of allowing identical constituents, though maybe there's a use case for this somewhere. We could fix this by making the Constituents field in View a QueryableSet instead of a QueryableList. If at some point in the future a compelling case for allowing duplicates is made, we can create a new type of View for it.

For what it's worth, here's a snippet that demonstrates the issue:

String[] sentence1 = {"John",  "ate", "breakfast", "gladly", "."};

List<String[]> tokenizedSentences = Collections.singletonList(sentence1);
TextAnnotation ta = BasicTextAnnotationBuilder.createTextAnnotationFromTokens("", "", tokenizedSentences);

SpanLabelView ner = new SpanLabelView("NER", ta);

// label token 0 as PER (not actually true)
Constituent c = new Constituent("PER", "NER", ta, 0,1);
ner.addConstituent(c);

// new object, but identical to c
Constituent c2 = new Constituent("PER", "NER", ta, 0,1);
ner.addConstituent(c2);

ta.addView("NER", ner);

// This returns just one constituent.
System.out.println(ner.getConstituentsCoveringToken(0));

// This returns two constituents.
System.out.println(ner.getConstituents());

// these constituents have equality, according to the measure implemented in Constituent.java
System.out.println(c.equals(c2));

// this fails (as Mark pointed out)
ner.addSpanLabel(0,1,"PER",1.);