Different score using different compute f1 functions
ZhaofengWu opened this issue · 8 comments
Hi,
This code snippet (adapted from RunModel_CoNLL_Format.py
) produces different outputs for the last two lines. However shouldn't we expect the same input?
#!/usr/bin/python
from __future__ import print_function
from util.preprocessing import readCoNLL, createMatrices, addCharInformation, addCasingInformation
from neuralnets.BiLSTM import BiLSTM
import sys
import logging
if len(sys.argv) < 3:
print("Usage: python RunModel.py modelPath inputPathToConllFile")
exit()
modelPath = sys.argv[1]
inputPath = sys.argv[2]
inputColumns = {0: "tokens", 1: 'NER_BIO'}
# :: Prepare the input ::
sentences = readCoNLL(inputPath, inputColumns)
addCharInformation(sentences)
addCasingInformation(sentences)
# :: Load the model ::
lstmModel = BiLSTM.loadModel(modelPath)
dataMatrix = createMatrices(sentences, lstmModel.mappings, True)
from util.BIOF1Validation import compute_f1_token_basis
print(compute_f1_token_basis(list(lstmModel.tagSentences(dataMatrix).values())[0], [s['NER_BIO'] for s in sentences], 'O'))
print(lstmModel.computeF1(list(lstmModel.models.keys())[0], dataMatrix))
Compute_f1_token_basis computes f1 on a token basis. I.e. when the model produces the tags B-PER I-PER and the gold label was B-PER O, then it gets a recall of 100% and precision of 50%.
The model.compute_f1 computes the F1 based on chunks like the conll 2000 eval script. For the above example, it would get 0% recall and 0% precision.
The f1 token basis score is useful when you tag really long chunks and when missing a token at the start/end is not so bad.
So compute_f1_token_basis
is like the conll 2003 eval script?
No. Model.computeF1 is like the CoNLL eval script
Are conll 2003 and 2000 the same eval script? Because you said model.compute_f1
is like the 2000 script.
CoNLL 2003 used the script from 2000, so for both competitions the same eval method was used
I see. What about compute_f1_argument
and compute_f1_argument_token_basis
? I assume they are parallel with compute_f1
and compute_f1_token_basis
. But what is the distinction of argument vs. non-argument?
I used _argument
F1-functions for event argument extraction, where a sentence can have multiple event triggers and each trigger can have multiple arguments.
In the provided architecture, they are not used and I should remove them.
Thanks!