Evaluate function not right
airkid opened this issue · 8 comments
https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_)
There will be assert error.
I believe this is because in arugments
there are golden arguments while only predict arugments in arguments_
, which length will change dynamicly during traning.
This computes the score wrong since if the model predict a wrong entity before all the good ones, the preds are not aligned and the score is 0, as shown in this example:
gold roles are [(3,5,11),(7,9,9)]
preds roles are [(0,2,2),(3,5,11),(7,9,9)]
first iteration: compare (3,5,11) and (0,2,2) -> fail
second iteration: compare (7,9,9) and (3,5,11) -> fail even though (3,5,11) was in the gold annotations.
Here is a functionning version that also generate a per class report (it requires tabulate)
Hi @airkid @DorianKodelja, I got with conclusion with you, according to DMCNN paper:
An argument is correctly classifiedd if its event subtype, offsets and argument role match those of any of the reference argument mentions
for item, item_ in zip(arguments, arguments_):
Above code in this repo does match the idea, so I replaced that line with:
ct += len(set(arguments) & set(arguments_)) # count any argument in golden
# for item, item_ in zip(arguments, arguments_):
# if item[2] == item_[2]:
# ct += 1
Hi @mikelkl , I believe this is a kind of right implementation of calculating F1 score in this task.
Have you reproduce the experiment? I can only reach F1 score < 0.4 in the test data.
Hi @airkid, I got slightly higher result, but it's on my own randomly splitting test set, hv no idea if it can efficively represent the paper result.
Hi @mikelkl, can you try on the data split update by author?
My result is still far away from the paper.
Hi @airkid Would you please tell me the result you got? I got only f1=0.64 in Trigger Classification.
https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_)
There will be assert error.
I believe this is because inarugments
there are golden arguments while only predict arugments inarguments_
, which length will change dynamicly during traning.
Hi,
If you've tried their code, would you tell me your reproduced results on trigger detection and argument detection?