This project provides the AutoBioNER framework for distantly supervised biomedical named entity recognition (BioNER).
The whole framework consists of two parts: Dictionary Expansion and Neural Model Training.
- Tokenized Raw Texts: e.g., DictExpan/data/bc5/input_text.txt
- One token per line.
- An empty line means the end of a sentence.
- Two Dictionaries
- Core Dictionary w/ Type Info: e.g., DictExpan/data/bc5/dict_core.txt
- Two columns (i.e., Type, Tokenized Surface) per line.
- Tab separated.
- How to obtain: from domain-specific dictionaries.
- Full Dictionary w/o Type Info: e.g., DictExpan/data/bc5/dict_full.txt
- One tokenized high-quality phrases per line.
- How to obtain: from domain-specific dictionaries and high-quality phrase mining tool on domain-specific corpus (e.g., AutoPhrase)
- Core Dictionary w/ Type Info: e.g., DictExpan/data/bc5/dict_core.txt
cd DictExpan/
# Download the Stanford CoreNLP Toolkit to src/tools/CoreNLP/
cd src/tools/CoreNLP/stanford-corenlp
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
# Need to change the corpus (data, RAW_TEXT, DICT_CORE, DICT_FULL) name in run.sh
# Need to change the corpus (data) name in src/corpusProcessing/corpusProcess.sh
# Need to change the corpus (data) name in src/dataProcessing/dataProcess.sh
# Need to change the corpus (data) name in src/SetExpan/set_expan_main.py
./run.sh
Two expanded dictionaries:
- Expanded core dictionary: e.g., DictExpan/data/bc5/dict_core_expand.txt
- Expanded full dictionary: e.g., DictExpan/data/bc5/dict_full_expand.txt
After the Dictionary Expansion step, take the tokenized raw corpus (DictExpan/data/bc5/input_text.txt), expanded core dictionary (DictExpan/data/bc5/dict_core_expand.txt) and expanded full dictionary (DictExpan/data/bc5/dict_full_expand.txt) as the input to AutoNER.
The details of the Neural Model Training can be found in the AutoNER repository.
If you find the implementation useful, please cite the following paper:
@inproceedings{wang2019distantly,
title={Distantly supervised biomedical named entity recognition with dictionary expansion},
author={Wang, Xuan and Zhang, Yu and Li, Qi and Ren, Xiang and Shang, Jingbo and Han, Jiawei},
booktitle={2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
pages={496--503},
year={2019},
organization={IEEE}
}