A collection of annotated, freely distributable, biomedical corpora, which can be used for training supervised machine learning methods for various tasks in biomedical text-mining and information extraction.
All corpora are provided in corpora
. They are divided into subdirectories NER
, for corpora which can be used to train named entity recognition (NER) solutions, and Relation Extraction
, for corpora which can be used to train relation/event extraction solutions. Corpora are provided in both a CoNLL-like format and a Standoff format.
Most corpora in the CoNLL-like format were originally collected here. In many cases, the tags were mapped to 4-letter codes:
Old tag | New tag |
---|---|
Chemical , Simple_chemical |
CHED |
Disease |
DISO |
Organism , Species , NCBITaxon , Taxon |
LIVB |
Cellular_component |
COMP |
Cell , cell_type |
CLTP |
cell_line |
CLLN |
Gene , Protein , Gene_or_gene_product , GGP |
PRGE |
Mappings were largely inspired by this API.
Corpora names (loosely) follow the naming scheme: <corpus_name>_<entity>_<tagset>
.
To download the corpora, simply clone the repository locally:
$ git clone https://github.com/BaderLab/Biomedical-Corpora.git
Or click the green Clone or download
button and select Download ZIP
.
https://github.com/spyysalo provides many useful repositories for working with these corpora. Many of the most popular corpora have their own repositories (e.g. S800, NCBI-Disease) which contain code for collecting the corpus from its original source and converting it into a format suitable for training a machine learning classifier (e.g. CoNLL or Standoff).
A list of various biomedical corpora and their corresponsding publications:
Corpora | Text Genre | Standard | Entities (Count) | Publication |
---|---|---|---|---|
AnatEM | Scientific Article | Gold | 12 Anatomical entities | link |
AZDC | Scientific Article | Gold | Disease | link |
BioCreative II GM | Scientific Article | Gold | Genes/proteins (24,583) | link |
BioInfer | Scientific Article | Gold | Genes/proteins | link |
BioSemantics | Patent | Gold | Chemicals, Disease | link |
BC4CHEMD | Scientific Article | Gold | Chemicals (84,310) | link |
BC5CDR | Scientific Article | Gold | Chemicals (15,935), Disease (12,852) | link |
BioNLP09 | Scientific Article | Gold | Genes/proteins (14,963) | link |
BioNLP11EPI | Scientific Article | Gold | Genes/proteins (15,811) | link |
BioNLP11ID | Scientific Article | Gold | Genes/proteins (6551), Organisms (3471), Chemicals (973), Regulon-operon (87) | link |
BioNLP13GE | Scientific Article | Gold | Genes/proteins (12,057) | link |
BioNLP13PC | Scientific Article | Gold | Genes/proteins (10,891), Chemicals (2487), Complexes (1502), Cellular component (1013) | link |
CRAFT | Scientific Article | Gold | Sequence Ontology (18,974), Gene/proteins (16,064), Taxonomy (6868), Chemicals of biological interest (6053), Cell lines (5495), GO-CC (4180) | link |
CellFinder | Scientific Article | Gold | Species, Gene/proteins, Cell type, Anatomy | link |
CHEMDNER Patent | Patent | Gold | Chemicals | link |
DECA | Scientific Article | Gold | Genes/proteins | link |
Ex-PTM | Scientific Article | Gold | Genes/proteins (4698) | link |
FSU-PRGE | Scientific Article | Gold | Genes/proteins | link |
JNLPBA | Scientific Article | Gold | Genes/proteins (35,336), DNA (10,589), Cell type (8639), Cell line (4330), RNA (1069) | link |
Linneaus | Scientific Article | Gold | Organisms (4263) | link |
LocText | Scientific Article | Gold | Organisms, Genes/proteins | link |
IEPA | Scientific Article | Gold | Genes/proteins | link |
miRNA | Scientific Article | Gold | Disease, Organisms, Genes/proteins | link |
NCBI disease | Scientific Article | Gold | Disease (6881) | link |
S800 | Scientific Article | Gold | Organisms (3708) | link |
Variome | Scientific Article | Gold | Disease, Organisms, Genes/proteins | link |
Note, some corpora included in this table are not included for download in this repository because they are not freely distributable.