Implementation of Scoring
Closed this issue · 9 comments
Need to give each LU a score which can be modified later by algorithms.
Deciding which data structure to use for this?
Currently thinking a vector of a pair (word, score) might work.
Algorithm:Read Input left to right, at each word boundary, apply relevant score additions/subtractions using CG and keep doing this until we reach a pronoun. Then we sort the structure and apply the highest scored reference. (sort a temporary structure, don't want to actually change the structure to preserve the order of words)
Note: Scores may be applied later as well.
using CG
? Do you mean using the @subj
etc. syntactic function annotations that the CG module from earlier in the pipeline gives?
yep, since the score will be given based on these syntactic annotations.
are these annotations available earlier in the pipeline? where exactly?
Also, I realised that we dont just have to give scores but build coreference chains. So have to decide on the data structure for that.
Second issue: each word needs to have a unique id for reference. Obviously we can't give all the words ids, so maybe drop ids after a certain threshold, i.e. after 100 words. Also, when a word is referred to, the pronoun gets added to the coreference chain which means the antecedent stays and doesn't get deleted. It gets refreshed, basically.
This is to ensure one can do anaphora resolution for stories which use a pronoun for the same subject. i.e., use the subject name once and a pronoun for all further references.
eg. John is a 6 year old boy. He likes to eat xyz. .... He is the son of abc. He plays basketball.
Creating coreference chains will be done after first eval, but scoring mechanism can be built before.
are these annotations available earlier in the pipeline? where exactly?
In https://github.com/apertium/apertium-eng-spa/blob/master/modes.xml#L3 you'd typically put a cg-proc eng-spa.rlx.bin
right after the first lt-proc …automorf.bin
to do rule-based morphological disambiguation, and then cg-proc eng-spa.syn.rlx.bin
after apertium-tagger
(or before it, if the disambiguation rules are very good). I just added a stub syntax CG in apertium/apertium-eng@a4e721b . You can make a new <mode>
in eng-spa with these steps.
Obviously we can't give all the words ids, so maybe drop ids after a certain threshold
I think CG actually never drops ids, at least they get quite high – @TinoDidriksen ?
CG can handle almost the full unsigned 32 bit range of tokens, so well over 4 billion, each with their own unique ID. I doubt you'll run into any CG limits.
Alright perfect so we won't drop ids. I'll study the cg-proc module.
Closed for now because implementing Mitkov Algorithm first