/Biomedical-Corpora

A collection of annotated biomedical corpora, which can be used for training supervised machine learning methods for various tasks in biomedical text-mining and information extraction.

Biomedical Corpora

A collection of annotated, freely distributable, biomedical corpora, which can be used for training supervised machine learning methods for various tasks in biomedical text-mining and information extraction.

All corpora are provided in corpora. They are divided into subdirectories NER, for corpora which can be used to train named entity recognition (NER) solutions, and Relation Extraction, for corpora which can be used to train relation/event extraction solutions. Corpora are provided in both a CoNLL-like format and a Standoff format.

Most corpora in the CoNLL-like format were originally collected here. In many cases, the tags were mapped to 4-letter codes:

Old tag New tag
Chemical, Simple_chemical CHED
Disease DISO
Organism, Species, NCBITaxon, Taxon LIVB
Cellular_component COMP
Cell, cell_type CLTP
cell_line CLLN
Gene, Protein, Gene_or_gene_product, GGP PRGE

Mappings were largely inspired by this API.

Corpora names (loosely) follow the naming scheme: <corpus_name>_<entity>_<tagset>.

Download

To download the corpora, simply clone the repository locally:

$ git clone https://github.com/BaderLab/Biomedical-Corpora.git

Or click the green Clone or download button and select Download ZIP.

Resources

https://github.com/spyysalo provides many useful repositories for working with these corpora. Many of the most popular corpora have their own repositories (e.g. S800, NCBI-Disease) which contain code for collecting the corpus from its original source and converting it into a format suitable for training a machine learning classifier (e.g. CoNLL or Standoff).

Table of Corpora

A list of various biomedical corpora and their corresponsding publications:

Corpora Text Genre Standard Entities (Count) Publication
AnatEM Scientific Article Gold 12 Anatomical entities link
AZDC Scientific Article Gold Disease link
BioCreative II GM Scientific Article Gold Genes/proteins (24,583) link
BioInfer Scientific Article Gold Genes/proteins link
BioSemantics Patent Gold Chemicals, Disease link
BC4CHEMD Scientific Article Gold Chemicals (84,310) link
BC5CDR Scientific Article Gold Chemicals (15,935), Disease (12,852) link
BioNLP09 Scientific Article Gold Genes/proteins (14,963) link
BioNLP11EPI Scientific Article Gold Genes/proteins (15,811) link
BioNLP11ID Scientific Article Gold Genes/proteins (6551), Organisms (3471), Chemicals (973), Regulon-operon (87) link
BioNLP13GE Scientific Article Gold Genes/proteins (12,057) link
BioNLP13PC Scientific Article Gold Genes/proteins (10,891), Chemicals (2487), Complexes (1502), Cellular component (1013) link
CRAFT Scientific Article Gold Sequence Ontology (18,974), Gene/proteins (16,064), Taxonomy (6868), Chemicals of biological interest (6053), Cell lines (5495), GO-CC (4180) link
CellFinder Scientific Article Gold Species, Gene/proteins, Cell type, Anatomy link
CHEMDNER Patent Patent Gold Chemicals link
DECA Scientific Article Gold Genes/proteins link
Ex-PTM Scientific Article Gold Genes/proteins (4698) link
FSU-PRGE Scientific Article Gold Genes/proteins link
JNLPBA Scientific Article Gold Genes/proteins (35,336), DNA (10,589), Cell type (8639), Cell line (4330), RNA (1069) link
Linneaus Scientific Article Gold Organisms (4263) link
LocText Scientific Article Gold Organisms, Genes/proteins link
IEPA Scientific Article Gold Genes/proteins link
miRNA Scientific Article Gold Disease, Organisms, Genes/proteins link
NCBI disease Scientific Article Gold Disease (6881) link
S800 Scientific Article Gold Organisms (3708) link
Variome Scientific Article Gold Disease, Organisms, Genes/proteins link

Note, some corpora included in this table are not included for download in this repository because they are not freely distributable.