/DeepEventMine

DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts

Primary LanguagePythonApache License 2.0Apache-2.0

1. DeepEventMine

A deep leanring model to predict named entities, triggers, and nested events from biomedical texts.

  • The model and results are reported in our paper:

DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts, Bioinformatics, 2020.

1.1. Features

  • Based on pre-trained BERT
  • Predict nested entities and nested events
  • Provide our trained models on the seven biomedical tasks
  • Reproduce the results reported in our Bioinformatics paper
  • Predict for new data given raw text input or PubMed ID
  • Visualize the predicted entities and events on the brat

1.2. Tasks

  • DeepEventMine has been trained and evaluated on the following tasks (six BioNLP shared tasks and MLEE).
  1. cg: Cancer Genetics (CG), 2013
  2. ge11: GENIA Event Extraction (GENIA), 2011
  3. ge13: GENIA Event Extraction (GENIA), 2013
  4. id: Infectious Diseases (ID), 2011
  5. epi: Epigenetics and Post-translational Modifications (EPI), 2011
  6. pc: Pathway Curation (PC), 2013
  7. mlee: Multi-Level Event Extraction (MLEE)

1.3. Our trained models and scores

2. Preparation

2.1. Requirements

  • Python 3.6.5
  • PyTorch (torch==1.1.0 torchvision==0.3.0, cuda92)
virtualenv -p python3 pytorch-env
source pytorch-env/bin/activate
export CUDA_VISIBLE_DEVICES=0
CUDA_PATH=/usr/local/cuda pip install torch==1.1.0 torchvision==0.3.0
  • Install Python packages
sh install.sh

2.2. BERT

  • Download SciBERT BERT model from PyTorch AllenNLP
sh download.sh bert

2.3. DeepEventMine

  • Download pre-trained DeepEventMine model on a given task
  • [task] = cg (or pc, ge11, epi, etc)
sh download.sh deepeventmine [task]

2.4 Brat

sh download.sh brat
cd brat/brat-v1.3_Crunchy_Frog/
./install.sh -u
python2 standalone.py

3. Predict (BioNLP tasks)

3.1. Prepare data

  1. Download corpora
  • To download the original data sets from BioNLP shared tasks.
  • [task] = cg, pc, ge11, etc
sh download.sh bionlp [task]
  1. Preprocess data
  • Tokenize texts and prepare data for prediction
sh preprocess.sh bionlp
  1. Generate configs
  • If using GPU: [gpu] = 0, otherwise: [gpu] = -1
  • [task] = cg, pc, etc
sh run.sh config [task] [gpu]

3.2. Predict

  1. For development and test sets (given gold entities)
  • CG task: [task] = cg
  • PC task: [task] = pc
  • Similarly for: ge11, ge13, epi, id, mlee
sh run.sh predict [task] gold dev
sh run.sh predict [task] gold test
  • Check the output in the path
experiments/[task]/predict-gold-dev/
experiments/[task]/predict-gold-test/

3.3. Evaluate

  1. Retrieve the original offsets and create zip format
sh run.sh offset [task] gold dev
sh run.sh offset [task] gold test
  1. Submit the zipped file to the shared task evaluation sites:
  1. Evaluate events
  • Evaluate event prediction for PC and CG tasks on the development sets using the shared task scripts.
  • Evaluation options: s (softboundary), p(partialrecursive)
sh run.sh eval [task] gold dev sp

4. End-to-end

4.1. Input: a single PMID or PMCID

  • Abstract
sh pubmed.sh e2e pmid 1370299 cg 0
  • Full text
sh pubmed.sh e2e pmcid PMC4353630 cg 0
T24	Organism 1248 1254	bovine
T25	Gene_or_gene_product 1255 1259	u-PA
T55	Positive_regulation 1107 1116	increased
T57	Localization 1170 1179	migration
T58	Negative_regulation 1260 1267	blocked
...

T23	Gene_or_gene_product 1184 1188	u-PA
T56	Positive_regulation 1157 1166	increases
E9	Positive_regulation:T56 Theme:T23

T26	Gene_or_gene_product 1320 1325	c-src
T62	Gene_expression 1326 1336	expression
E10	Gene_expression:T62 Theme:T26

T61	Positive_regulation 1310 1319	increased
E24	Positive_regulation:T61 Theme:E10



4.2. Input: a list of PMIDs

  • Given an arbitrary name for your raw text data, for example "my-pubmed"
  • Prepare a list of PMID and PMCID in the path
data/my-pubmed/pmid.txt
sh pubmed.sh e2e pmids my-pubmed cg 0

4.3. Input: raw text files

  • Given an arbitrary name for your raw text data, for example "my-pubmed"
  • Prepare your raw text files in the path
data/my-pubmed/text/PMID-*.txt
data/my-pubmed/text/PMC-*.txt
sh pubmed.sh e2e rawtext my-pubmed cg 0

5. Predict for new data (step-by-step)

  • Input: your own raw text or PubMed ID
  • Output: predicted entities and events in brat format

5.1. Raw text

  • Given an arbitrary name for your raw text data, for example "my-pubmed"
  • Prepare your own raw text in the following path
data/my-pubmed/text/PMID-*.txt
data/my-pubmed/text/PMC-*.txt

5.2. PubMed ID

  • Or, you can automatically get raw text given PubMed ID or PMC ID

Get raw text

  1. PubMed ID list
  • In order to get full text given PMC ID, the text should be available in ePub (for our current version).
  • Prepare your list of PubMed ID and PMC ID in the path
data/my-pubmed/pmid.txt
  • Get text from the PubMed ID
sh pubmed.sh pmids my-pubmed
  1. PubMed ID
  • You can also get text by directly input a PubMed or PMC ID
sh pubmed.sh pmid 1370299
sh pubmed.sh pmcid PMC4353630

Preprocess

sh pubmed.sh preprocess my-pubmed

5.3. Predict

  1. Generate config
  • Generate config for prediction
  • The data name to predict: my-pubmed
  • The trained model used for predict: cg (or pc, ge11, etc)
  • If you use gpu [gpu]=0, otherwise [gpu]=-1
sh pubmed.sh config my-pubmed cg 0
  1. Predict
sh pubmed.sh predict my-pubmed
  1. Retrieve the original offsets
sh pubmed.sh offset my-pubmed
  • Check the output in
experiments/my-pubmed/results/ev-last/my-pubmed-brat

6. Visualization

6.1. Prepare data

  • Copy the predicted data into the brat folder to visualize
  • For the raw text prediction:
sh pubmed.sh brat my-pubmed cg
  • Or for the shared task
sh run.sh brat [task] gold dev
sh run.sh brat [task] gold test

6.2. Visualize

  • The data to visualize is located in
brat/brat-v1.3_Crunchy_Frog/data/my-pubmed-brat
brat/brat-v1.3_Crunchy_Frog/data/[task]-brat

7. Acknowledgements

This work is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This work is also supported by PRISM (Public/Private R&D Investment Strategic Expansion PrograM).

8. Citation

@article{10.1093/bioinformatics/btaa540,
    author = {Trieu, Hai-Long and Tran, Thy Thy and Duong, Khoa N A and Nguyen, Anh and Miwa, Makoto and Ananiadou, Sophia},
    title = "{DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts}",
    journal = {Bioinformatics},
    year = {2020},
    month = {06},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btaa540},
    url = {https://doi.org/10.1093/bioinformatics/btaa540},
    note = {btaa540},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/doi/10.1093/bioinformatics/btaa540/33399046/btaa540.pdf},
}