WNUT 2020 shared task is designed on the wet lab protocol data. Wet Lab protocols are bascially a collection of steps from different lab procedures. They are noisy, dense, and domain-specific. Automatic or semi-automatic conversion of protocols into machine-readable format benefits medical and biological research. In this task, participants are invited for event recognition and relation extraction over these lab protocols. Note that these protocols are written by researchers and lab technicians from all over the world, some of which may contain non-standard language or spelling errors.
All of the protocols were collected from protocols.io using their public APIs. The full protocol-dump is also available as json format in their github repository. For this shared task, we provide the annotation of 615 protocols. The BRAT styled annotated protocols can be visulalized at: http://bit.ly/WNUT2020. Below is a sample of the input data:
- Jeniya Tabassum (Ohio State University)
- Wei Xu (Ohio State University → Georgia Tech)
- Alan Ritter (Ohio State University → Georgia Tech)
Our Wet Lab Protocol (WLP) dataset consists of 615 unique protocols from the 623 protocols of Kulkarni et al. (2018). It excludes the following 8 duplicated protocols:
- protocol 45 (duplicate of protocol 441)
- protocol 459 (duplicate of protocol 310)
- protocol 464 (duplicate of protocol 46)
- protocol 480 (duplicate of protocol 473)
- protocol 482 (duplicate of protocol 474)
- protocol 483 (duplicate of protocol 475)
- protocol 484 (duplicate of protocol 476)
- protocol 621 (duplicate of protocol 570)
After discarding the duplicate protocols, the remaining 615 unique protocols are re-annoated in brat with 3 annotators with 0.75 inter-annotator agreement, measured by span-level Cohen’s Kappa. Our annotators added the missing entity-relations and also corrected the incosistencies. The updated dataset is provided in the data directory in both StandOff and CoNLL format. The data is divided in 3 sub-directories as below:
- train_data: 370 protocols with 8444 sentences
- dev_data: 122 protocols with 2839 sentences
- test_data: 123 protocols with 2813 sentences
We will release a new test set on August 31, 2020 for the offical evaluation. Please sign up here to receive the data in email. We will use a mailing list for annoucements and discussions.
- Data available: June 8
- Evaluation window: Aug 31 - Sep 4
- System Description Papers Submitted: Sep 18
- Papers Reviewed: Sep 28
- Papers Camera-ready: Oct 8
- Workshop Day: Nov 11
We have provided a feature based Linear CRF tagger for the Named Entity Recognition Task.
The participants are required to produce predictions on the protocols as StandOff format or CoNLL format, which will be compared with the gold data using the evalution script.
.
├── code
│ ├── baseline_CRF
│ │ ├── Conll_Outputs
│ │ ├── Standoff_Outputs
│ │ └── gazetters
│ ├── eval
│ └── scripts
│ ├── convert_conll_to_standoff
│ └── convert_standoff_conll_ner
└── data
├── dev_data
│ ├── Conll_Format
│ └── Standoff_Format
├── test_data
│ ├── Conll_Format
│ └── Standoff_Format
└── train_data
├── Conll_Format
└── Standoff_Format
Paper about the shared task:
@inproceedings{tabassum2020,
author={Tabassum, Jeniya and Lee, Sydney and Xu, Wei and Ritter, Alan },
title={{WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols}},
booktitle={Proceedings of EMNLP 2020 Workshop on Noisy User-generated Text {(WNUT)}},
year = {2020}
}
Paper about the dataset:
@inproceedings{kulkarni2018wetlab,
author = {Kulkarni, Chaitanya and Xu, Wei and Ritter, Alan and Machiraju, Raghu},
title = {An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols},
booktitle = {Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL)},
year = {2018}
}