Is relation extraction still a part of this Shared Task?
stgrue opened this issue · 7 comments
Hi! I was looking at the released data and code and noticed some things that led me to wonder: Is relation extraction still part of this Shared Task?
-
There are inconsistencies between the Standoff and the CoNLL data, making combining the data difficult. (A similar issue was already raised in #3.) These are related to tokenization. Two randomly chosen examples:
- In
protocol_7.ann
, line 268, the tokenuntransformed
actually contains a space at the end, which messes up token boundaries. This seems to be because it is a non-breaking space and happens multiple times in the data. protocol_102_conll.txt
, line 149, contains the tokenFLowCamMake
, labeled asB-Device
. In contrast,protocol_102.ann
correctly identifies that there is actually an entity boundary with in this sequence of characters, splitting it up intoFLowCam
andMake sure
.
- In
-
Unless I am missing something, there does not seem to be an official evaluation script for relations/events. The Standoff-to-CoNLL conversion script also seems to completely ignore relations. The baseline system also seems to only predict entities.
-
The Readme.md is titled "WNUT 2020: Named Entity Extraction", not mentioning relations.
So my question is: Are relations still part of the Shared Task? If so, will an official evaluation script be released for them?
Sorry for the confusion. Jeniya (co-organizer) is working on the evaluation script and a baseline model for relation extraction. We hope to be able to release it in a week.
We plan to use only the Standoff format (not CoNLL) for relation extraction task, and hold a separate evaluation period for relation task (about a week after the named entity task evaluation). We are currently thinking to provide the gold name entities to the participants to use in the relation extraction evaluation.
- There are inconsistencies between the Standoff and the CoNLL data, making combining the data difficult. (A similar issue was already raised in #3.)
I have also observed similar discrepancy in several files.
As per Conll_Format/protocol_101_conll.txt, "to" is Action and not "Go"
Go O
to B-Action
the O
Whereas as per Standoff_Format/protocol_101.ann
T3 Action 135 137 Go
Standoff_Format seems to be the correct one.
Thanks for pointing out these issues. I will take a look into the conversion script and update the scipt and Conll files. Basically the Conll files are created from the standoff format. Please use the standoff format files if you prefer.
@kaushikacharya & @stgrue Please check out the updated anntoconll_wlp.py
script and the updated Conll data. We have corrected the empty space issues in the Conll data.
@stgrue, please check this repository for the RE evaluation script: https://github.com/jeniyat/WNUT_2020_RE
@jeniyat
Your updated script has corrected previously mentioned error instances. But has introduced error in new places.
train_data/Conll_Format/protocol_101_conll.txt
with your user B-Reagent
name I-Reagent
train_data/Standoff_Format/protocol_101.ann
T30 Reagent 89 98 user name
Two issues:
-
https://stackoverflow.com/questions/27416164/what-is-conll-data-format
As per David McClosky's answer (username: dmcc),Each line represents a single word with a series of tab-separated fields.
-
As Reagent entity is
user name
conll file should have
user B-Reagent
name I-Reagent
Anyway, as suggested by you, better to use standoff format.
@kaushikacharya please check the updated conll conversion script and data, if you want to utilize check the conll format.