lx865712528/EMNLP2018-JMEE

Evaluate function not right

airkid opened this issue · 8 comments

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_)
There will be assert error.
I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

This computes the score wrong since if the model predict a wrong entity before all the good ones, the preds are not aligned and the score is 0, as shown in this example:
gold roles are [(3,5,11),(7,9,9)]
preds roles are [(0,2,2),(3,5,11),(7,9,9)]
first iteration: compare (3,5,11) and (0,2,2) -> fail
second iteration: compare (7,9,9) and (3,5,11) -> fail even though (3,5,11) was in the gold annotations.
Here is a functionning version that also generate a per class report (it requires tabulate)

calculate_sets_1.txt

Hi @airkid @DorianKodelja, I got with conclusion with you, according to DMCNN paper:

An argument is correctly classifiedd if its event subtype, offsets and argument role match those of any of the reference argument mentions

for item, item_ in zip(arguments, arguments_): 

Above code in this repo does match the idea, so I replaced that line with:

ct += len(set(arguments) & set(arguments_))  # count any argument in golden
# for item, item_ in zip(arguments, arguments_):
#     if item[2] == item_[2]:
#         ct += 1

Hi @mikelkl , I believe this is a kind of right implementation of calculating F1 score in this task.
Have you reproduce the experiment? I can only reach F1 score < 0.4 in the test data.

Hi @airkid, I got slightly higher result, but it's on my own randomly splitting test set, hv no idea if it can efficively represent the paper result.

Hi @mikelkl, can you try on the data split update by author?
My result is still far away from the paper.

Hi @airkid, I'm afraid I cannot do that coz I hv no ACE2005 English data

Hi @airkid Would you please tell me the result you got? I got only f1=0.64 in Trigger Classification.

https://github.com/lx865712528/JMEE/blob/494451d5852ba724d273ee6f97602c60a5517446/enet/testing.py#L72
In this line, if I add a line of code before
assert len(arugments) == len(argumenst_)
There will be assert error.
I believe this is because in arugments there are golden arguments while only predict arugments in arguments_, which length will change dynamicly during traning.

Hi,

If you've tried their code, would you tell me your reproduced results on trigger detection and argument detection?