List of resources on scholarly data analysis ranging from datasets, papers, and code about bibliometrics, citation analysis, and other scholarly commons resources. Available online at
Table of contents generated with markdown-toc
- Arnet Miner
- Microsoft Academic Graph
- OpenAlex - Replacement for MAG
- Open Academic Graph - MAG + AMiner
- OpenAIRE Research Graph - More info here
- Semantic Scholar Corpus
- CiteSeer
- PubMed
- CORA datasets for citation string parsing
- Humanities and multilingual citation string parsing Flux-CiM and ICONIP see Neural ParsCit paper for details
- Citation string parsing data for social sciences for English and German citations - comparison with Grobid and Cermine
- CrossRef DOI URLs
- DOIboost (Crossres + MAG + ORCID + Unpaywall)
- DBLP Citation dataset
- DBLP XML data
- DBLP Discovery Dataset (D3)
- NBER Patent Citations
- Scopus Citation Database
- Papers, patents, and grants from Indiana University
- Small Network Data - Mark Newman's Lab
- The Koblenz Network Collection
- Google Scholar citation relations
- Google Scholar Citations data set direct-download
- Open citations project
- Wikicite Project
- Ecnonomic Papers
- ArXiv data dump
- ArXiv data on Kaggle
- EuropePMC
- Complete ACL anthology as bibtex file
- ACL Anthology Reference Corpus
- Astrophysics data system (ADS) - All physics papers
- CORE 37M full text open access papers
- Inspire database for high energy physics articles
- Scholarly Data of workshops and conferences in RDF triplets
- The Collection of Computer Science Bibliographies
- OpenCitations corpus
- COCI Doi-Doi citation data
- DOAJ API (Directory of Open Access Journals)
- ROAD (Directory of Open Access Scholarly Resources)
- Sherpa/Romeo (Publisher copyright policies & self-archiving)
- OpenAPC (fees paid for open access journal articles)
- OSF API (Open Science Framework)
- Digital tools for researchers
- Fatcat - versioned, publicly-editable catalog of research publications
- Microsoft Academic Knowledge Graph - RDF dump
- arXiv CS citation in context
- arXiv fulltext + citations dataset
- Self-citation analysis data based on PubMed Central subset (2002-2005)
- Unpaywalled Corpus - PDF to 23M DOIs Data Schema
- A dataset of publication records for Nobel laureates - paper
- OpenAIRE Scholexplorer - 126+ Million literature-dataset and dataset-dataset links between 12+ Million objects - About the data
- Manually annotated citation data from the ACL Anthology into uses, motivation, future, extends, compare or contrast, and background
- iCite - NIH Open Citation Collection
- MEDLINE/PubMed Baseline Repository (MBR) - All Medline abstracts and paper paper meta-data in XML
- American Physical Society Data Sets for Research
- Co-citation networks of all Nature papers
- Semantic Scholar Graph of References in Context (GORC) dataset
- Multiple journal publication datasets
- Structured citations in the English Wikipedia
- ICSR Lab (free for researchers) for scopus and plumx use
- COVID-19 Open Research Dataset (CORD-19)
- PaperRobot - includes PubMed Paper Reading Dataset
- SciMag - Microsoft Academic Linked to SciMago Journals - WebPage
- SciGraph Springer Nature
- Citations to scholarly data in various language wikipedias Code
- 800K publications matched from CrossRef, CORE, and Mendeley with data on publication and open access dates
- Coronavirus Open Citations Dataset
- Crossref dumps DOI meta-data
- S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
- Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia
- Microsoft Academic Data for conducting covid-19 research
- Initiative for Open Abstracts
- Dataset Search: metadata for datasets - Datasets with DOIs and compact identifiers
- Open Syllabus Project
- Journal Causal effect in Citations
- Sci-Hub Download Logs - Latest
- Sci-Hub databases
- SAGE Rejected article tracker dataset from ArXiv - Github
- The Open Research Knowledge Graph (ORKG)
- Test of Time Awards
- ACL-Cite-Net
- The DBLP Discovery Dataset (D3): A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research Zenodo
- Papers and patents are becoming less disruptive over time - Paper
- OpenAIRE Research Graph Dump
- OpCitance: Citation contexts identified from the PubMed Central open access articles
- A large dataset of scientific text reuse in Open-Access publications
- A dataset of publication records for Nobel laureates
- PeerRead - paper drafts, reviews, and accept/reject decision
- CiteTracked: A Longitudinal Dataset of Peer Reviews and Citations - Contact Author
- Elsevier's Peer Review Workbench
- ACL-18 Numerical Peer Review Dataset
- Argument Mining for Understanding Peer Reviews
- APE: Argument Pair Extraction - Annotated ICLR 2013-2020 review-rebuttal argument pair
- Argument Mining Driven Analysis of Peer-Reviews Dataset
- Publons review length dataset with 498K reviews - anonymized
- Peer review analyze: A novel benchmark resource for computational analysis of peer reviews
- Open Editors: data about scholarly journals' editors and editorial board members - Github
- NLPEER: A Unified Resource for the Computational Study of Peer Review
- eLife Open Peer Review Corpus
- PLoS Open Peer Review Corpus
- MDPI Open Peer Review Corpus
- GrantExplorer: a free, open-source tool for examining the phrases funded by U.S. federal agencies
- Award Data Archive
- NIH research funding
- Authors linked to PIs in NIH Grants
- Mathematics Genealogy Project
- Academic Tree - Cross discipline academic genealogies
- MPACT project - Library Sciences
- PhDTree
- Chemistry Genealogy - curated at UIUC
- Notre Dame Genealogy Project
- UIUC Chemistry, Chemical Engineering, and Biochemistry
- Software Engineering Academic Genealogy
- Other lists of genealogy projects
- Wikipedia - Computer Science Genealogy
- Wikipedia - Theorecical Physicits Genealogy
- Wikipedia - Chemists Genealogy
- SCIENTIFIC GENEALOGY MASTER LIST - Scientists Associated with Concepts in Chemistry & Physics
- Economic Geneology Text Format
- S2AMP : Semantic Scholar Analysis of Mentorship Dataset
- MENTORSHIP - A dataset of mentorship in science with semantic and demographic estimations - Code
- Temporal profiles of PubMed authors
- ORCID data dump
- National Library of Medicine Profiles
- UIUC Professors database - Publications, Affiliations
- Author Profiles of scholarly authors in Wikipedia
- Career Transitions of CS students
- Author name gender and ethnicity dataset based on PubMed
- MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide
- Conceptual novelty scores for PubMed articles
- 100,000 top-scientists that provides standardized information on citations, h-index, co-authorship adjusted hm-index, citations to papers in different authorship positions and a composite indicator
- Canadian PhD career survey - Science report
- Data from the CVs of over 150 assistant professors in psychology in top-ranked research universities and small liberal art colleges in the US - Used in this blog
- Wikidata Author Disambiguation Dataset
- The 4 Universities Data Set - Web pages of CS departments classified for author role (faculty, student, etc.)
- Journal editors dataset
- Career long various citation metrics for 100,000 top-scientists
- Network-Data-Career-Transitions - two anonymized network datasets of post-PhD career transitions and trajectories in computing research
- Open dataset of scholars on Twitter - 500K OpenAlex Author ID to Twitter User Id
- Gender Inequities in the Online Dissemination of Scholars’ Work
- INSPIRE dataset
- Lee Giles dataset
- Cleaner version of Lee Giles dataset
- DBLP Korean Authors
- Arnet Miner
- Arnet Miner - Manual Name Disambiguation data 210 authors
- DBLP Name disambiguation dataset - Error corrected version
- rexa-coref-data
- Dedped author names on IEEE Vis papers 1990-2018
- Author-ity dataset for PubMed 2009
- ACL Anthology dataset
- Base data for estimating precision and recall of Author-ity among NIH-funded scientists
- ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale
- S2AND - Semantic Scholar Author Name Disambiguation Tool and Dataset
- BibTex Dataset for 1M authors
- Ethnicity sensitive author disambiguation from INSPIRE HEP
- Pre-processed PubMed data for a study of coauthorship
- WhoIsWho: Web-Scale Academic Name Disambiguation:the WhoIsWho Benchmark,Leaderboard,and Toolkit
- LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation - Github
- Open Access Theses and Dissertations
- The Networked Digital Library of Theses and Dissertations (NDLTD)
- PhD Dissertations in the Area of Software Engineering
- ProQuest Dissertations & Theses Global
- History Dissertation Analysis
- Peer-making: the interconnections between PhD Thesis Committee membership and co-publishing - Zenodo
- DISAPERE: A Dataset for DIscourse Structure in Academic PEer REview
- ETDs: Virginia Tech Electronic Theses and Dissertations
- DSpace@MIT: a digital repository for MIT's research, including peer-reviewed articles, technical reports, working papers, theses, and more
- The ScanBank Dataset: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations
- ETDMiner: extract metadata from scanned ETD Google Drive
- Citation Parsing
- Citation Parsing in humanities
- Sentences tagged for Drug Disease pairs
- Document Summarization and citation span identification
- ACL Anthology human summaries for 1000 papers
- Keyphrase Extraction
- Related Work Summarization
- Biomedical NLP annotated datasets
- Chemical compound and drug name recognition task
- Semantic Scholar Dataset
- ScienceIE
- ACL RD TEC 2.0 also at @CLARIN
- SEPID Corpus - Segmended ACL ARC 1.0
- PubMed Central Open Access - BioC
- PubMed Fulltext - protein-protein and genetic interactions
- BioNLP - Argo
- Biomedical NLP - Stav
- GENIA - BioNLP 2011
- Genia Treebank used for SciSpacy training - SciSpacy link
- Full GENIA corpus
- Anatomical Entity Mention (AnEM) corpus
- CellFinder - Entity detection
- Multi-Level Event Extraction (MLEE)
- Biomedical sentence simplification
- PubMed - Colorado Richly Annotated Full-Text
- Biomedical NER datasets related publication
- BioVerbNet
- Lunar and Planetary Science abstracts for NER and Relations
- ACM data affiliations
- ACM - DBLP database entry matching
- Colorado Richly Annotated Full-Text - PubMed abstract annotated with entities mapped to 10 biomedical ontology terms.
- CLEF datasets for multilingual Biomedical NLP+IE
- MedMentions - UMLS entities in PubMed
- Colright Initiatve - Rich text competition
- SciERC - scientific entities, their relations, and coreference clusters for 500 AI conf abstracts
- PubMed200k_RCT - Label abstract sentences into Objective, Background, Method, Results, Conclusions
- NER, Parsing, Classification datasets from SciBert
- ACA Wiki - Paper summaries of more than 1600 papers
- SemEval-2018 task 7 Semantic Relation Extraction and Classification in Scientific Papers
- A Compendium of Free, Public Biomedical Text Mining Tools Available on the Web
- Medical Information Extraction from PubMed abstracts
- Corpus of 40 scientific papers manually annotated by multiple scientific discourse facets
- PharmaCoNER: Pharmacological Substances, Compounds and proteins and Named Entity Recognition track - Train - Dev - Test - Background Test set
- Bacteria Biotope (BB) Task - NER, NEL, Relation, KB Extraction
- Entity/relation recognition and GOF/LOF mutated gene text identification task based on the Active Gene Annotation Corpus
- The Regulatory Network of Plant Seed Development (SeeDev) Task - NER, Relation
- TalkSumm - Summary of papers via alignment to talks
- SeminalSurveyDBLP - Classification of seminal or survey papers
- - PubMed supplement-drug interactions and supplement-supplement interactions
- GENETAG - More recent versions Publication and Download 2005
- MedTag: A Collection of Biomedical Annotations - Download
- Open Biomedical corpora
- Biomedical Abstract Meaning Representation corpus based on PubMed Fulltext - Also see other NLM curated biomedical resources
- SciDTB: Discourse Dependency TreeBank for Scientific Abstracts
- SciDTB corpus annotated for argumentation mining - Paper
- Dr. Inventor Multi-layer Scientific Corpus for multiple scientific discourse facets
- ART corpus - 225 papers manually annotated the CISP labels (i.e. "Goal", "Method", "Result").- Browse files - Project details
- Multi-CoreSC CRA corpus (MCCRA) - 50 papers annotated with multiple CoreSC labels per sentence. - Project details
- PubMedQA - Question answering on PubMed
- Corposaurus - Collection of biomedical corpus for NER
- BioNER corpus
- NeuroQuery - 14,000 full-text publications and 400,000 peak activations - NeuroQuery website
- Medical Information Extraction dataset
- A Large Parallel Corpus of Full-Text Scientific Articles
- Annotated Corpus of Scientific Conference's Homepages for Information Extraction
- Chi QA - Health Question Answering dataset from NLM
- Corpus of Open Access articles from multiple fields in Science, Technology, and Medicine - Includes wikification data
- Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- Open Research Knowledge Graph project - Website
- Academic PhraseBank
- SciKG - Statement extraction datasets
- A Fully Coreference-annotated Corpus of Scholarly Papers from the ACL Anthology
- A manual corpus of annotated main findings of clinical case reports
- TREC Precision Medicine / Clinical Decision Support Track
- Lots of biomedical entity linking and entity identification datasets
- Materials Science Named Entity Recognition: train/development/test sets
- Entities in 3.27 million materials science abstracts
- Normalized entities in material science papers
- Named Entity Recognition for Bacterial Type IV Secretion Systems - Paper
- Annotating and detecting phenotypic information for chronic obstructive pulmonary disease
- MiRoR11 - P2 - Annotated corpus for primary and reported outcomes extraction
- Data from: PGxCorpus, a Manually Annotated Corpus for Pharmacogenomics
- Multiple PUBMED annotated corpora from iProLink project
- Mars Target Encyclopedia - LPSC abstracts labeled data set
- Annotation of phenotypes using ontologies
- The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text - SPECIES Direct Download - ORGANISMS Direct Download
- Entity mention in articles used for benchmark
- RAMBO 800+: A Corpus for the Development of Gene/Protein Recognition from Rare and Ambiguous Abbreviations
- Medical Relation Extraction - CrowdTruth
- KP20k - Kehphrase extraction on 20k abstracts
- Named Entity Recognition: (17.3 MB), 8 datasets on biomedical named entity recognition
- Relation Extraction: (2.5 MB), 2 datasets on biomedical relation extraction
- Question Answering: (5.23 MB), 3 datasets on biomedical question answering task
- SciREX : A Challenge Dataset for Document-Level Information Extraction
- Papers with Code - Links between papers and repositories and extraction of SOTA results
- Citation Context Classification based on purpose
- Citation Context Classification based on influence
- PubMed knowledge graph (PKG) Figshare
- Citation and Header Datasets
- Gobrid-NER data
- Multiple NER and Entity Linking data for science
- Scitation Context Classification
- S2ORC: The Semantic Scholar Open Research Corpus - 12.7M full text papers
- EuropePMC annotations for entities and relationships
- NLPContributionGraph - Structuring Scholarly NLP Contributions in the Open Research Knowledge Graph
- GOBRID Sequence Labeling data
- The General Index - Metadata, Ngrams, and Keyphrases in 107,233,728 journal articles
- Pubtrends Review Dataset
- PubTator Central (PTC) - NLP annotated PMC datasets
- PubMedCentral Author Manuscript Collection
- Paper analyzer pubmed
- NER on Material Science Papers
- SoMeSci - Software Mentions in Science
- NLMChem a new resource for chemical entity recognition in PubMed full-text literature
- Scientific summarization datasets
- PubMed Classification
- Annotated scientific findings with sentence-level and aspect-level certainty
- SoftwareKG_Social and SoftwareKG_PubMed - Software mentions in articles
- Bioinformatics Named Entity Recogniser for Databases and Software
- The CodeMeta Project: preservation, discovery, reuse, and attribution of software
- Social Science Software Citation Dataset
- SoMeSci - A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles
- Softcite dataset: A gold-standard dataset of software mentions in research publications for supervised learning based named entity recognition
- SoftwareKG-PMC:a Knowledge Graph of Software mentions extracted from articles of the PMC Open Access Dataset
- DEAL: Detecting Entities in the Astrophysics Literature
- SCIERC: Multi-Task Identification of Entities, Relations, and Coreferencefor Scientific Knowledge Graph Construction - Code
- University of Washington BIO NLP datasets
- multimodal_summ: Multimodal summarization of research papers
- ACL Anthology Corpus - Full Text
- Entity Linking of Crossref Funding Orgs in Acknowledgements - paper
- Microsoft Academic Knowledge Graph (MAKG) - Zenodo ComplEx entity embeddings (120 GB) for all 243 million authors, 239 publications, 49,000 journals, and 16,000 conferences
- Wikidata:WikiProject Clinical Trials
- A Dataset of Alt Texts from HCI Publications
- PubMed-OA-Extraction-dataset
- SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
- The MAPLE Benchmark for Scientific Literature Tagging
- ACL Anthology Network
- I³ Open Innovation Dataset Index - Multiple datasets related to patent networks, inventor careers, etc.
- Science4cast Competition - capture the evolution of scientific concepts and predict which research topics will emerge in the coming years
- SciGraph Springer Nature
- Medical Subject Headings maintained by the National Library of Medicine of the United States
- Computer Science Ontology maintained by Scholarly Knowledge: Modeling, Mining and Sense Making
- Physics Subject Headings (PhySH) maintained by American Physical Society (APS) GitHub
- Open Biological and Biomedical Ontology (OBO) maintained by the OBO Foundry
- ACM Computing Classification System maintained by the Association for Computing Machinery
- Physics and Astronomy Classification Scheme (PACS) maintained by American Institute of Physics (AIP) discontinued in 2010 and replaced by Physics Subject Headings
- Mathematics Subject Classification (MSC) mantained by Mathematical Reviews and zbMATH
- Journal of Economic Literature (JEL) maintained by the American Economic Association
- STW Thesaurus for Economics maintained by ZBW - Leibniz Information Centre for Economics
- Australian and New Zealand Standard Research Classification (ANZSRC) maintained by Australian Bureau of Statistics, it consists of 3 sub-classification schemes:
- Fields of Research (FoR) classification
- Research Fields, Courses and Disciplines (RFCD) classification
- Socio-Economic Objective (SEO) classification
- Library of Congress Classification (LCC) maintained by Library of Congress
- Fields of Study (FoS) maintained by Microsoft Academic
- CrossRef Open Funder's Registry
- Scientific Keyphrase Extraction Datasets - KP20k, NUS, MAG_KP
- Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- XL-BEL is a benchmark for cross-lingual biomedical entity linking (XL-BEL). The benchmark spans 10 typologically diverse languages
- IteraTeR: Understanding Iterative Revision from Human-Written Text based on ArXiv abstract edit versions
- CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation
- AckExtract: Acknowledgement and its name entities extraction from scholarly papers
- The MSVEC Dataset: Multi-Domain Scientific Claim Verification Evaluation Corpus (MSVEC)
- GIANT: The 1-Billion Annotated Synthetic Bibliographic-Reference-String Dataset for Deep Citation Parsing - dataverse
- Altmetrics API
- API - documentation, example
- Core Conference Rankings
- China Computer Federation Conference Rankings
- Google Scholar
- Semantic Scholar
- Microsoft Academic Graph
- OpenAIRE Explore
- AceMap
- GitXiv
- ACL Anthology
- NIPS papers
- Abel tools for PubMed data
- infolis: linking research data and publications
- Metrics toolkit
- Rcrossref (R library)
- Rscopus (R library)
- Scholar (R library)
- Bibliometrix (R library)
- CITAN (R library)
- BibeR (BibeR: A Web-based tool for bibliometric analysis in scientific literature)
- (Python library)
- SoPaper (Python library)
- CiteSeer tools
- Novelty quantification in PubMed articles
- TidyPMC - R based PMC XML parser
- PublicationHarvester - Download PubMed publications of an author
- Publish or Perish - retrieves and analyzes academic citations from MS Academic and Scholar
- Affiliation string parser
- CiteSeerX
- Data Set Knowledge Graph (DSKG) - a RDF data set about data sets
- Citation Gecko - Find related papers
- pySciSci - Python tool for working with MAG, PubMed, etc.
- ACM Digital Library
- ContentMine - getpapers
- rcoreoa - CORE API R client
- metaknowledge - A Python library for doing bibliometric and network analysis in science and health policy research
- PubMedPortable - PubMed to Postgres
- medic - Parsing MEDLINE and storing into a DB
- Biomedical - BioSentVec Embeddings
- Biomedical embeddings - CambridgeLTL
- NIH scientific paper pre-processing
- SciSpacy - Spacy models for Biomedical NLP from AllenAI
- Multitask Biomedical NER
- SciBERT - Bert LM for Biomedical and CS papers
- Grobid
- EXCITE (Extraction of Citations from PDF Documents)
- Science-Parse
- unarXiv (Citation in context from arXiv)
- Biblio-Glutton
- CrossRef Reference Matching code and evaluation data
- Citation style classifier and evaluation data
- refextract - extracting references used in scholarly communication
- Frontiers in Research Metrics and Analytics
- Scientometrics
- Journal of Informetrics
- Quantitative Science Studies (Open Access)
- Science, technology and human values
- Social Studies of Science
- Science and Public Policy
- Joint Conference on Digital Libraries (JCDL)
- International Conference on Theory and Practice of Digital Libraries (TPDL)
- European Semantic Web Conference (ESWC), Research of Research Track
- STI Conference series (Science and Technology indicators, e.g., 2018)
- SIGMET - Metrics workshop
- International Workshop on Mining Scientific Publications
- Semantics, Analytics, Visualisation: Enhancing Scholarly Dissemination (SAVE-SD)
- Workshop on Reframing Research (RefResh)
- Enabling Open Semantic Science (SemSci)
- Workshop on Scholarly Document Processing
- International Society for Informetrics and Scientometrics (ISSI)
- European Network of Indicator Designers (ENID)
- 4S (Society for Social Studies of Science)
- SIG/MET - Special Interest Group for the measurement of information production and use
The following people have contributed to the items on this list.
- Shubhanshu Mishra - Maintainer of the list.
- Angelo Antonio Salatino
- Philipp Zumstein
- Ali (Aliakbar Akbaritabar)
- Andrea Mannocci