Ab3P (Abbreviation Plus P-Precision) |
2008 |
BioC |
1250 PubMed Abstracts |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2576267/ |
http://bioc.sourceforge.net/ |
|
AIMed |
2005 |
BioC |
~ 1000 MEDLINE abstracts (200 abstracts) |
http://www.sciencedirect.com/science/article/pii/S0933365704001319 |
http://corpora.informatik.hu-berlin.de/ |
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.3218&rep=rep1&type=pdf |
AnatEM (Anatomical entity mention recognition) |
2013 |
CONLL, standoff |
1212 docs (500 docs from AnEM + 262 from MLEE + 450 others) |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3957068/ |
http://nactem.ac.uk/anatomytagger/#AnatEM |
|
AnEM |
2012 |
BioC |
500 docs (PubMed and PMC); abstracts and full text drawn randomly |
http://www.nactem.ac.uk/anatomy/docs/ohta2012opendomain.pdf |
http://corpora.informatik.hu-berlin.de/ |
|
AZDC (Arizona Disease Corpus) |
2009 |
IeXML, .txt |
2856 PubMed abstracts (2775 sentences). Other source says 794 PubMed Abstracts |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2352871/ |
http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/IeXML/goldcorpus/azdc-1.xml |
http://diego.asu.edu/downloads/AZDC_6-26-2009.txt |
BEL (BioCreative V5 BEL Track) |
2016 |
BioC |
|
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4995071/ |
https://wiki.openbel.org/display/BIOC/Datasets |
|
BioADI |
2009 |
BioC |
1201 PubMed abstracts |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788358/ |
http://bioc.sourceforge.net/ |
|
BioCause |
2013 |
standoff |
19 full-text documents |
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-2 |
http://www.nactem.ac.uk/biocause/download.php |
|
BioCreative-PPI |
|
XML |
|
|
https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html |
|
BioGRID |
2017 |
BioC |
120 full text articles |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5225395/ |
http://bioc.sourceforge.net/BioC-BioGRID.html |
|
BioInfer |
2007 |
BioC |
1100 sentences from biomedical literature |
http://www.biomedcentral.com/1471-2105/8/50 |
http://corpora.informatik.hu-berlin.de/ |
http://mars.cs.utu.fi/BioInfer |
BioMedLat |
2016 |
standoff |
643 BioASQ questions/factoids |
https://www.semanticscholar.org/paper/BioMedLAT-Corpus-Annotation-of-the-Lexical-Answer-Neves-Kraus/b0f09f94015771c31bd2483efdd8f0f86996384e |
https://github.com/mariananeves/BioMedLAT |
|
BioText |
2004 |
txt |
100 titles and 40 abstracts |
http://biotext.berkeley.edu/papers/acl04-relations.pdf |
https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html |
|
CDR (BioCreative V) |
|
BioC |
|
|
http://bioc.sourceforge.net/ |
|
CellFinder 1.0 |
2012 |
BioC |
10 full documents from PMC from (Loser et al. 2009) on "Human Embryonic Stem Cell Lines and Their Use in International Research" |
http://www.nactem.ac.uk/biotxtm2012/presentations/Neves-pres.pdf |
http://corpora.informatik.hu-berlin.de/ |
http://cellfinder.de/about/annotation/ |
CG Cancer-Genetics (BioNLP-ST 2013) |
2013 |
BioC, standoff |
|
http://aclweb.org/anthology/W/W13/W13-2008.pdf |
http://2013.bionlp-st.org/tasks/cancer-genetics |
|
CHEMDNER (BioCreative IV Track 2) |
2013 |
BioC / standoff |
|
http://www.biocreative.org/media/store/files/2013/bc4_v2_1.pdf |
http://www.biocreative.org/tasks/biocreative-iv/chemdner/ |
|
Chemical Patent Corpus |
2014 |
standoff |
200 patents |
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0107477 |
http://biosemantics.org/index.php/resources/chemical-patent-corpus |
|
CoMAGC |
2013 |
XML |
821 sentences on prostate, breast and ovarian cancer |
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-323 |
http://biopathway.org/CoMAGC/ |
|
CRAFT |
2012 |
|
97 full OA biomedical articles |
|
http://bionlp-corpora.sourceforge.net/CRAFT/ |
|
Craven (Wisconsin corpus) |
1999 |
other |
1,529,731 sentences (automated) |
https://www.biostat.wisc.edu/~craven/ie/ReadMe |
https://www.biostat.wisc.edu/~craven/ie/ |
|
CTD (BioCreative IV Track 3) |
|
BioC |
|
|
http://www.biocreative.org/tasks/biocreative-iv/track-3-CTD/ |
|
DDICorpus |
2011 2013 |
BioC |
792 texts from DrugBank and 233 Medline abstracts |
https://www.ncbi.nlm.nih.gov/pubmed/23906817 |
http://bioc.sourceforge.net/ http://corpora.informatik.hu-berlin.de/ |
http://labda.inf.uc3m.es/ddicorpus |
DIP-PPI (Database of Interaction Proteins) |
|
other |
Only proteins from yeast. |
|
https://www2.informatik.hu-berlin.de/~hakenber/corpora/ |
|
EBI:diseases |
2008 |
other |
856 sentences from 624 abstracts |
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-S3-S3 |
https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html |
ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases |
eFIP |
2012 2015 |
xlsx |
|
https://www.ncbi.nlm.nih.gov/pubmed/23221174 https://www.ncbi.nlm.nih.gov/pubmed/25833953 |
http://research.bioinformatics.udel.edu/iprolink/corpora.php |
|
EMU (Extractor of Mutations) |
2011 |
other |
|
https://www.ncbi.nlm.nih.gov/pubmed/21138947 |
http://bioinf.umbc.edu/EMU/ftp/ |
|
EU-ADR |
2012 |
other |
300 PubMed abstracts (drug-disoder, drug-target, gene-disorder, SNP-disorder) |
http://www.sciencedirect.com/science/article/pii/S1532046412000573 |
http://biosemantics.org/index.php/resources/euadr-corpus |
|
Exhaustive PTM (BioNLP 2011) |
|
|
|
http://dl.acm.org/citation.cfm?id=2002902.2002920 |
https://github.com/dterg/exhaustive-ptm |
|
FlySlip |
2007 |
CONLL |
82 abstracts, 5 full papers |
https://www.ncbi.nlm.nih.gov/pubmed/17990496 |
http://compbio.ucdenver.edu/ccp/corpora/obtaining.shtml |
http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip/Flyslip-resources |
FSU-PRGE |
2010 |
leXML |
3236 MEDLINE abstracts (35,519 sentences) |
http://aclweb.org/anthology/W/W10/W10-1838.pdf |
http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html |
|
GAD |
2015 |
csv |
|
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0472-9 |
http://ibi.imim.es/research-lines/biomedical-text-mining/corpora/ |
|
GeneReg |
2010 |
BioC |
314 Abstracts |
http://www.lrec-conf.org/proceedings/lrec2010/pdf/407_Paper.pdf |
http://corpora.informatik.hu-berlin.de/ |
http://www.julielab.de/Resources/GeneReg.html |
GeneTag (BioCreative II Gene Mention) |
2005 |
BioC |
20,000 sentences MEDLINE |
https://www.ncbi.nlm.nih.gov/pubmed/15960837 |
https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html http://bioc.sourceforge.net/ |
|
GENIA (BioNLP Shared Task 2009) |
|
|
|
|
http://www.nactem.ac.uk/tsujii/GENIA/SharedTask/detail.shtml#downloads |
|
GENIA (BioNLP Shared Task 2011) |
|
BioC, standoff |
|
|
https://sites.google.com/site/bionlpst/home/epigenetics-and-post-translational-modifications http://2011.bionlp-st.org |
http://corpora.informatik.hu-berlin.de/ |
GENIA (term annotation) |
2003 |
BioC, XML |
|
|
http://corpora.informatik.hu-berlin.de/ |
http://www.nactem.ac.uk/aNT/genia.html |
GETM |
2010 |
BioC, standoff |
|
http://dl.acm.org/citation.cfm?id=1869970 |
http://corpora.informatik.hu-berlin.de/ |
http://getm-project.sourceforge.net/ |
GREC (Gene Regulation Event Corpus) |
2009 |
BioC, standoff, XML |
240 MEDLINE (167 on E.coli and 73 on Human) |
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-349 |
http://corpora.informatik.hu-berlin.de/ |
http://www.nactem.ac.uk/GREC/ |
HIMERA |
2016 |
standoff |
|
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144717 |
http://www.nactem.ac.uk/himera/ |
|
HPRD50 (Human Protein Reference Database) |
2004 |
BioC |
50 abstracts |
https://www.ncbi.nlm.nih.gov/pubmed/14681466 |
http://corpora.informatik.hu-berlin.de/ |
http://www2.bio.ifi.lmu.de/publications/RelEx/ |
IDP4+ |
2007 |
anndoc |
860 abstracts/full-texts |
https://academic.oup.com/bioinformatics/article/33/12/1852/2991428 |
https://www.tagtog.net/-corpora/IDP4+ |
|
IEPA |
2002 |
BioC |
slightly over 300 MEDLINE abstracts |
https://www.ncbi.nlm.nih.gov/pubmed/11928487 |
http://corpora.informatik.hu-berlin.de/ |
http://orbit.nlm.nih.gov/resource/iepa-corpus |
iHOP |
2004 |
other |
~ 160 sentences |
https://www.ncbi.nlm.nih.gov/pubmed/15226743 |
http://www.ihop-net.org/UniPub/iHOP/info/gene_index/manual/1.html |
|
iProLINK / RLIMS |
2004 |
other, XML, BioC |
|
https://www.ncbi.nlm.nih.gov/pubmed/15556482 |
http://research.bioinformatics.udel.edu/iprolink/corpora.php |
|
iSimp |
2014 |
BioC |
130 MEDLINE abstracts (1199 sentences) |
https://www.ncbi.nlm.nih.gov/pubmed/24850848 |
http://research.bioinformatics.udel.edu/isimp/corpus.html |
|
Linnaeus |
2010 |
standoff |
|
http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-85 |
https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html |
http://linnaeus.sourceforge.net/ |
LLL (Learning Language in Logic) |
2005 |
BioC |
|
https://www.cs.york.ac.uk/aig/lll/lll05/lll05-nedellec.pdf |
http://corpora.informatik.hu-berlin.de/ |
http://genome.jouy.inra.fr/texte/LLLchallenge/ |
MEDSTRACT |
|
BioC |
199 PubMed citations |
https://www.ncbi.nlm.nih.gov/pubmed/11604766 |
http://bioc.sourceforge.net/ |
|
MedTag |
2005 |
other |
|
https://www.researchgate.net/publication/234785358_MedTag_a_collection_of_biomedical_annotations |
ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedTag/medtag.tar.gz https://sourceforge.net/projects/medtag/ |
|
Metabolite and Enzyme |
2011 |
BioC, XML |
296 abstracts |
http://link.springer.com/article/10.1007%2Fs11306-010-0251-6 |
http://www.nactem.ac.uk/metabolite-corpus/ |
http://argo.nactem.ac.uk/bioc/ |
miRTex |
2015 |
BioC, standoff |
350 abstracts (200 development, 150 test) |
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004391 |
http://research.bioinformatics.udel.edu/iprolink/corpora.php |
|
MLEE |
2012 |
CONLL, standoff |
262 PubMed abstracts on molecular mechanisms of cancer (specifically relating to angiogenesis) |
https://academic.oup.com/bioinformatics/article/28/18/i575/249872/Event-extraction-across-multiple-levels-of |
http://nactem.ac.uk/MLEE/ |
|
mTOR pathway event corpus (BioNLP 2011) |
2011 |
standoff |
|
http://dl.acm.org/citation.cfm?id=2002919 |
https://github.com/dterg/mtor-pathway/tree/master/original-data |
|
MutationFinder |
2007 |
other |
305 abstract (development data set), 508 abstract test set |
https://www.ncbi.nlm.nih.gov/pubmed/17495998 |
http://mutationfinder.sourceforge.net/ |
https://github.com/rockt/SETH |
Nagel |
|
XML, standoff |
|
|
http://sourceforge.net/projects/bionlp-corpora/files/ProteinResidue/ |
|
NCBI Disease |
2012 |
other |
6881 sentences in 793 PubMed abstracts |
https://www.ncbi.nlm.nih.gov/pubmed/24393765 |
http://www.ncbi.nlm.nih.gov/CBBresearch/Fellows/Dogan/disease.html |
|
OMM (Open Mutation Miner) |
2012 |
other |
40 full texts |
https://www.ncbi.nlm.nih.gov/pubmed/22759648 |
http://www.semanticsoftware.info/open-mutation-miner |
|
OSIRIS |
2008 |
BioC, XML, standoff |
105 articles |
https://www.ncbi.nlm.nih.gov/pubmed/18251998 |
http://corpora.informatik.hu-berlin.de/ |
https://sites.google.com/site/laurafurlongweb/databases-and-tools/corpora |
PC (Pathway Curation) (BioNLP-ST 2013) |
2013 |
BioC |
|
|
http://argo.nactem.ac.uk/bioc/ |
http://2013.bionlp-st.org/tasks/pathway-curation |
PennBioIE-oncology |
2004 |
leXML |
1414 PubMed abstracts on cancer |
http://www.aclweb.org/anthology/W04-3111 |
http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/corpora.html |
|
pGenN (Plant-GN) |
2015 |
BioC |
104 MEDLINE abstracts |
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0135305 |
http://research.bioinformatics.udel.edu/iprolink/corpora.php |
|
PICAD |
2011 |
XML |
1037 sentences from PubMed |
http://dl.acm.org/citation.cfm?doid=2147805.2147853 |
http://ani.stat.fsu.edu/~jinfeng/resources/PICAD.txt |
http://corpora.informatik.hu-berlin.de/ |
PolySearch (includes v1. and v2.) |
|
other |
|
https://www.ncbi.nlm.nih.gov/pubmed/25925572 |
http://polysearch.cs.ualberta.ca/downloads |
|
ProteinResidue |
|
other |
|
|
http://bionlp-corpora.sourceforge.net/ |
|
SCAI_Klinger |
2008 |
CONLL |
|
https://academic.oup.com/bioinformatics/article/24/13/i268/235854/Detection-of-IUPAC-and-IUPAC-like-chemical-names |
https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html |
|
SCAI_Kolarik |
2008 |
CONLL |
|
http://www.lrec-conf.org/proceedings/lrec2008/workshops/W4_Proceedings.pdf#page=55 |
https://www.scai.fraunhofer.de/en/business-research-areas/bioinformatics/downloads/corpora-for-chemical-entity-recognition.html |
|
SETH |
2016 |
standoff |
630 publications from The American Journal of Human Genetics and Human Mutation |
https://www.ncbi.nlm.nih.gov/pubmed/?term=27256315 |
https://github.com/rockt/SETH/tree/master/resources/SETH-corpus |
|
SH (Schwartz and Hearst) |
2003 |
BioC |
1000 PubMed Abstracts |
https://www.ncbi.nlm.nih.gov/pubmed/12603049 |
http://bioc.sourceforge.net/ |
|
SNPCorpus |
2011 |
BioC |
296 MEDLINE abstracts |
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3194196/ |
http://corpora.informatik.hu-berlin.de/ |
http://www.scai.fraunhofer.de/snp-normalization-corpus.html |
Species |
2013 |
standoff |
800 PubMed abstracts |
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0065390 |
http://species.jensenlab.org/ |
http://species.jensenlab.org/ |
T4SS (Type 4 Secretion System) |
2011 |
CONLL |
|
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780 |
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0014780 |
|
T4SS Event Extraction (BioNLP 2010) |
2010 |
other |
|
http://dl.acm.org/citation.cfm?id=1869961.1869980 |
https://github.com/dterg/t4ss-event |
|
tmVar |
2013 |
BioC |
500 PubMed abstracts |
https://www.ncbi.nlm.nih.gov/pubmed/23564842 |
https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#tmVar |
https://github.com/rockt/SETH |
VariomeCorpus (hvp) |
2013 |
BioC |
|
https://www.ncbi.nlm.nih.gov/pubmed/23584833 |
http://corpora.informatik.hu-berlin.de/ |
http://www.opennicta.com/home/health/variome |
Yapex |
2002 |
other |
99 training, 101 test MEDLINE abstracts |
https://www.ncbi.nlm.nih.gov/pubmed/12460631 |
http://www.rostlab.org/~nlprot/yapex.txt |
https://www2.informatik.hu-berlin.de/~hakenber/links/benchmarks.html |