#Introduction
The BioCADDIE Pilot 3.2 is a scalable data mining platform to cross-link data and publications. This pilot project is part of the BioCADDIE project and provides tools for extracting data set mentions from the full text publications in the PubMedCentral Open Access Subset. Its initial focus is the date mention extraction of Protein Data Bank data sets, but the framework supports extend to other data resources. It also offers tools to analyze citation networks in PubMedCentral using a number of network metrics to rank data mentions by importance.
This project operates on the following data sets
- Protein Data Bank (PDB): >110,000 3D structure of biomolecules (current list)
- PubMedCentral Open Access Subset: >1 million free text articles (current list)
##What are Data Mentions?
Data mentions are references to data sets in publications that fall into two categories: 1. structured data mentions can be recognized by regular expressions matching, 2. unstructured data mentions require natural language processing and machine learning to disambiguate valid from invalid data mentions.
Structured data mentions for PDB Identifiers
Reference | Example |
---|---|
PDB ID | PDB ID: 1STP |
PDB DOI | http://dx.doi.org/10.2210/pdb1stp/pdb |
RCSB PDB URL | http://www.rcsb.org/../structureId=1stp |
NXML External Link | <ext-link .. ext-link-type=“pdb” xlink:href=“1STP”> |
Unstructured data mentions for PDB Identifiers
Type | Example |
---|---|
Valid PDB ID (4AHQ) | The structure of the active site of the K165C enzyme (4AHQ) ... |
Invalid PDB ID (2C19) | The polymorphisms of cytochrome P450 2C19 (CYP2C19) gene ... |
##Data Mention Extraction for PDB IDs The extraction of data mentions involves the following steps
- Download PDB and PMC metadata
- Download PMC OC full text articles
- Create positive and negative training/test sets for data mention disambiguation
- Fit machine learning model for data mention disambiguation
- Predict PDB data mentions for all PMC OC articles
Data Mention Extraction details
##Project Status
This project is in active development. Expect major refactoring of current code.
##Want to Use or Contribute to this Project? Contact us pwrose@ucsd.edu
##Technology Stack This project relies on the open-source technologies Apache Spark and Apache Parquet to make literature data mining fast and parallelizable.
###Apache Spark
Apache Spark is a fast and general framework for large-scale in-memory data processing. It runs locally, on an in-house or commercial cloud environment. We use Spark DataFrames to store, filter, sort, and join data sets.
###Apache Parquet
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. We store Spark DataFrame as Parquet files for high-performance data handling.