UKPLab/emnlp2017-bilstm-cnn-crf

Different score using different compute f1 functions

ZhaofengWu opened this issue · 8 comments

Hi,

This code snippet (adapted from RunModel_CoNLL_Format.py) produces different outputs for the last two lines. However shouldn't we expect the same input?

#!/usr/bin/python
from __future__ import print_function
from util.preprocessing import readCoNLL, createMatrices, addCharInformation, addCasingInformation
from neuralnets.BiLSTM import BiLSTM
import sys
import logging

if len(sys.argv) < 3:
    print("Usage: python RunModel.py modelPath inputPathToConllFile")
    exit()

modelPath = sys.argv[1]
inputPath = sys.argv[2]
inputColumns = {0: "tokens", 1: 'NER_BIO'}

# :: Prepare the input ::
sentences = readCoNLL(inputPath, inputColumns)
addCharInformation(sentences)
addCasingInformation(sentences)

# :: Load the model ::
lstmModel = BiLSTM.loadModel(modelPath)
dataMatrix = createMatrices(sentences, lstmModel.mappings, True)

from util.BIOF1Validation import compute_f1_token_basis
print(compute_f1_token_basis(list(lstmModel.tagSentences(dataMatrix).values())[0], [s['NER_BIO'] for s in sentences], 'O'))
print(lstmModel.computeF1(list(lstmModel.models.keys())[0], dataMatrix))

Compute_f1_token_basis computes f1 on a token basis. I.e. when the model produces the tags B-PER I-PER and the gold label was B-PER O, then it gets a recall of 100% and precision of 50%.

The model.compute_f1 computes the F1 based on chunks like the conll 2000 eval script. For the above example, it would get 0% recall and 0% precision.

The f1 token basis score is useful when you tag really long chunks and when missing a token at the start/end is not so bad.

So compute_f1_token_basis is like the conll 2003 eval script?

No. Model.computeF1 is like the CoNLL eval script

Are conll 2003 and 2000 the same eval script? Because you said model.compute_f1 is like the 2000 script.

CoNLL 2003 used the script from 2000, so for both competitions the same eval method was used

I see. What about compute_f1_argument and compute_f1_argument_token_basis? I assume they are parallel with compute_f1 and compute_f1_token_basis. But what is the distinction of argument vs. non-argument?

I used _argument F1-functions for event argument extraction, where a sentence can have multiple event triggers and each trigger can have multiple arguments.

In the provided architecture, they are not used and I should remove them.

Thanks!