Pytorch Solution of Event Extraction Task using BERT on ACE 2005 corpus
-
Prepare ACE 2005 dataset.
-
Use nlpcl-lab/ace2005-preprocessing to preprocess ACE 2005 dataset in the same format as the data/sample.json. Then place it in the data directory as follows:
├── data │ └── test.json │ └── dev.json │ └── train.json │...
这是data的分布格式。
-
change into ere dataset
├── data │ └── test.json │ └── dev.json │ └── train.json │...
tokens:[] list形式,将sentence拆分为一维的list[] "tokens": ["Con", "respecto", "a", "la", "pregunta", "que", "se", "deben", "estar", "haciendo", "..."]
-
setence与tokens可以对应 "sentence": "Con respecto a la pregunta que se deben estar haciendo..."
-
entity_mentions 实体提及,是文本中指代实体(enetity)的词 实体:先列出来BIOES分别代表什么意思:
B,即Begin,表示开始
I,即Intermediate,表示中间
E,即End,表示结尾
S,即Single,表示单个字符
O,即Other,表示其他,用于标记无关字符 其中,PER代表人名, LOC代表位置, ORG代表组织. B-PER、I-PER代表人名首字、人名非首字, B-LOC、I-LOC代表地名(位置)首字、地名(位置)非首字,B-ORG、I-ORG代表组织机构名首字、组织机构名非首字,O代表该字不属于命名实体的一部分 [{"id": "c93832992e8ca0020c806137834bdd38-0-42-303", "start": 6, "end": 7, "entity_type": "PER", "mention_type": "PRO", "text": "se"}] 与ace对比: "golden-entity-mentions": [ { "text": "we", "entity-type": "ORG:Media", "head": { "text": "we", "start": 2, "end": 3 },
- Install the packages.
pip install pytorch==1.0 pytorch_pretrained_bert==0.6.1 numpy
python train.py
python eval.py --model_path=latest_model.pt
Method | Trigger Classification (%) | Argument Classification (%) | ||||
---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |
JRNN | 66.0 | 73.0 | 69.3 | 54.2 | 56.7 | 55.5 |
JMEE | 76.3 | 71.3 | 73.7 | 66.8 | 54.9 | 60.3 |
This model (BERT base) | 63.4 | 71.1 | 67.7 | 48.5 | 34.1 | 40.0 |
The performance of this model is low in argument classification even though pretrained BERT model was used. The model is currently being updated to improve the performance.
- Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation (EMNLP 2018), Liu et al. [paper]
- lx865712528's EMNLP2018-JMEE repository [github]
- Kyubyong's bert_ner repository [github]
train(model, train_iter, optimizer, criterion)