Different score using different compute f1 functions

Question

Different score using different compute f1 functions

ZhaofengWu opened this issue 6 years ago · 8 comments

Hi,

This code snippet (adapted from RunModel_CoNLL_Format.py) produces different outputs for the last two lines. However shouldn't we expect the same input?

#!/usr/bin/python
from __future__ import print_function
from util.preprocessing import readCoNLL, createMatrices, addCharInformation, addCasingInformation
from neuralnets.BiLSTM import BiLSTM
import sys
import logging

if len(sys.argv) < 3:
    print("Usage: python RunModel.py modelPath inputPathToConllFile")
    exit()

modelPath = sys.argv[1]
inputPath = sys.argv[2]
inputColumns = {0: "tokens", 1: 'NER_BIO'}

# :: Prepare the input ::
sentences = readCoNLL(inputPath, inputColumns)
addCharInformation(sentences)
addCasingInformation(sentences)

# :: Load the model ::
lstmModel = BiLSTM.loadModel(modelPath)
dataMatrix = createMatrices(sentences, lstmModel.mappings, True)

from util.BIOF1Validation import compute_f1_token_basis
print(compute_f1_token_basis(list(lstmModel.tagSentences(dataMatrix).values())[0], [s['NER_BIO'] for s in sentences], 'O'))
print(lstmModel.computeF1(list(lstmModel.models.keys())[0], dataMatrix))

ZhaofengWu commented 6 years ago

Thanks!

Answer 1 · 2018-06-30T05:52:27.000Z

Compute_f1_token_basis computes f1 on a token basis. I.e. when the model produces the tags B-PER I-PER and the gold label was B-PER O, then it gets a recall of 100% and precision of 50%.

The model.compute_f1 computes the F1 based on chunks like the conll 2000 eval script. For the above example, it would get 0% recall and 0% precision.

The f1 token basis score is useful when you tag really long chunks and when missing a token at the start/end is not so bad.

Answer 2 · 2018-06-30T06:08:03.000Z

So compute_f1_token_basis is like the conll 2003 eval script?

Answer 3 · 2018-06-30T11:15:36.000Z

No. Model.computeF1 is like the CoNLL eval script

Answer 4 · 2018-06-30T14:56:41.000Z

Are conll 2003 and 2000 the same eval script? Because you said model.compute_f1 is like the 2000 script.

Answer 5 · 2018-06-30T20:56:58.000Z

CoNLL 2003 used the script from 2000, so for both competitions the same eval method was used

Answer 6 · 2018-07-01T01:03:35.000Z

I see. What about compute_f1_argument and compute_f1_argument_token_basis? I assume they are parallel with compute_f1 and compute_f1_token_basis. But what is the distinction of argument vs. non-argument?

Answer 7 · 2018-07-01T21:22:00.000Z

I used _argument F1-functions for event argument extraction, where a sentence can have multiple event triggers and each trigger can have multiple arguments.

In the provided architecture, they are not used and I should remove them.