jeniyat/WNUT_2020_NER

Is relation extraction still a part of this Shared Task?

stgrue opened this issue · 7 comments

Hi! I was looking at the released data and code and noticed some things that led me to wonder: Is relation extraction still part of this Shared Task?

  1. There are inconsistencies between the Standoff and the CoNLL data, making combining the data difficult. (A similar issue was already raised in #3.) These are related to tokenization. Two randomly chosen examples:

    • In protocol_7.ann, line 268, the token untransformed actually contains a space at the end, which messes up token boundaries. This seems to be because it is a non-breaking space and happens multiple times in the data.
    • protocol_102_conll.txt, line 149, contains the token FLowCamMake, labeled as B-Device. In contrast, protocol_102.ann correctly identifies that there is actually an entity boundary with in this sequence of characters, splitting it up into FLowCam and Make sure.
  2. Unless I am missing something, there does not seem to be an official evaluation script for relations/events. The Standoff-to-CoNLL conversion script also seems to completely ignore relations. The baseline system also seems to only predict entities.

  3. The Readme.md is titled "WNUT 2020: Named Entity Extraction", not mentioning relations.

So my question is: Are relations still part of the Shared Task? If so, will an official evaluation script be released for them?

Sorry for the confusion. Jeniya (co-organizer) is working on the evaluation script and a baseline model for relation extraction. We hope to be able to release it in a week.

We plan to use only the Standoff format (not CoNLL) for relation extraction task, and hold a separate evaluation period for relation task (about a week after the named entity task evaluation). We are currently thinking to provide the gold name entities to the participants to use in the relation extraction evaluation.

  1. There are inconsistencies between the Standoff and the CoNLL data, making combining the data difficult. (A similar issue was already raised in #3.)

I have also observed similar discrepancy in several files.

e.g. protocol_101 (2nd line)
image

As per Conll_Format/protocol_101_conll.txt, "to" is Action and not "Go"

Go	O
to	B-Action
the	O

Whereas as per Standoff_Format/protocol_101.ann
T3 Action 135 137 Go

Standoff_Format seems to be the correct one.

Thanks for pointing out these issues. I will take a look into the conversion script and update the scipt and Conll files. Basically the Conll files are created from the standoff format. Please use the standoff format files if you prefer.

@kaushikacharya & @stgrue Please check out the updated anntoconll_wlp.py script and the updated Conll data. We have corrected the empty space issues in the Conll data.

@stgrue, please check this repository for the RE evaluation script: https://github.com/jeniyat/WNUT_2020_RE

@jeniyat
Your updated script has corrected previously mentioned error instances. But has introduced error in new places.

image

train_data/Conll_Format/protocol_101_conll.txt

with your user	B-Reagent
name	I-Reagent

train_data/Standoff_Format/protocol_101.ann
T30 Reagent 89 98 user name

Two issues:

user name

conll file should have

user	B-Reagent
name	I-Reagent

Anyway, as suggested by you, better to use standoff format.

@kaushikacharya please check the updated conll conversion script and data, if you want to utilize check the conll format.