Implementation of Scoring

Question

Implementation of Scoring

Closed this issue 5 years ago · 9 comments

Need to give each LU a score which can be modified later by algorithms.
Deciding which data structure to use for this?

Currently thinking a vector of a pair (word, score) might work.

Algorithm:Read Input left to right, at each word boundary, apply relevant score additions/subtractions using CG and keep doing this until we reach a pronoun. Then we sort the structure and apply the highest scored reference. (sort a temporary structure, don't want to actually change the structure to preserve the order of words)

Note: Scores may be applied later as well.

Answer 1 · 2019-06-21T11:50:00.000Z

using CG? Do you mean using the @subj etc. syntactic function annotations that the CG module from earlier in the pipeline gives?

Answer 2 · 2019-06-21T17:26:18.000Z

yep, since the score will be given based on these syntactic annotations.
are these annotations available earlier in the pipeline? where exactly?

Also, I realised that we dont just have to give scores but build coreference chains. So have to decide on the data structure for that.

Answer 3 · 2019-06-21T21:39:21.000Z

Second issue: each word needs to have a unique id for reference. Obviously we can't give all the words ids, so maybe drop ids after a certain threshold, i.e. after 100 words. Also, when a word is referred to, the pronoun gets added to the coreference chain which means the antecedent stays and doesn't get deleted. It gets refreshed, basically.

This is to ensure one can do anaphora resolution for stories which use a pronoun for the same subject. i.e., use the subject name once and a pronoun for all further references.

eg. John is a 6 year old boy. He likes to eat xyz. .... He is the son of abc. He plays basketball.

Answer 4 · 2019-06-21T21:39:53.000Z

Creating coreference chains will be done after first eval, but scoring mechanism can be built before.

Answer 5 · 2019-06-22T09:05:23.000Z

are these annotations available earlier in the pipeline? where exactly?

In https://github.com/apertium/apertium-eng-spa/blob/master/modes.xml#L3 you'd typically put a cg-proc eng-spa.rlx.bin right after the first lt-proc …automorf.bin to do rule-based morphological disambiguation, and then cg-proc eng-spa.syn.rlx.bin after apertium-tagger (or before it, if the disambiguation rules are very good). I just added a stub syntax CG in apertium/apertium-eng@a4e721b . You can make a new <mode> in eng-spa with these steps.

Answer 6 · 2019-06-22T09:06:12.000Z

Obviously we can't give all the words ids, so maybe drop ids after a certain threshold

I think CG actually never drops ids, at least they get quite high – @TinoDidriksen ?

Answer 7 · 2019-06-22T09:43:08.000Z

CG can handle almost the full unsigned 32 bit range of tokens, so well over 4 billion, each with their own unique ID. I doubt you'll run into any CG limits.

Answer 8 · 2019-06-22T10:00:46.000Z

Alright perfect so we won't drop ids. I'll study the cg-proc module.

Answer 9 · 2019-06-28T15:29:55.000Z

Closed for now because implementing Mitkov Algorithm first