Lazy instance parsing in Corpus
alexanderkoller opened this issue · 0 comments
Currently, an Instance is a bundle of algebra values. These values are provided either by parsing a line of the corpus in Corpus#readCorpusWrapper or by direct construction, e.g. in the CorpusConverter and its uses.
This is limiting. It means that a corpus can only contain interpretations whose algebra implements Algebra#parseString, which is not necessarily true if the interpretation is intended for outputs only. One could easily implement an "exact match" evaluation in the ParsingEvaluator script which parses some input interpretations into an output interpretation and then checks for string equality.
We should change the Corpus class to lazy parsing of instances: They are stored as strings until someone requests an algebra value for that specific interpretation, at which point the string is parsed into an algebra value and cached. Here are some thoughts on the ramifications.
- The Instance must be able to deliver algebra values from strings when requested, so it needs to know the associated IRTG (or at least, an "AlgebraBundle" interface which could be instantiated with either an IRTG or just a map of interpretation names to algebras).
- As far as I can tell, Instances are only ever created in contexts where IRTGs are known, so this should be fine.
- One could implement getInputObjects as parsing all strings to values, caching them into a map, and then returning the map as it is now.
- One might rename getInputObjects into getAllObjects and implement more fine-grained methods getObject(interpretation) to avoid parsing all interpretations. This should probably be accompanied with methods getInterpretations() and hasInterpretation().
- There need to be additional methods getString(interpretation) or some such. How should we deal with cases where we directly construct the value rather than the string, as e.g. in PTBConverter?
- We should check that Corpus/Instance are covered well by unit tests before making such a deep change.