/datasource

A library of code for parsing (mostly biomedical) data source files and converting their contents to RDF

Primary LanguageJavaBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

A library of code for parsing (mostly biomedical) data source files

Prerequisites

  • Java, at least version 8, is required.
  • Apache Maven is required to build the project.
  • If you intend to build this project inside of an IDE, such as Eclipse, please see the instructions for using the Lombok library with your IDE here.

Installation

To use the scripts included in this project, e.g. to generate an RDF representation for a given datasource from the command line, you must download and install the project:

$ git clone https://github.com/UCDenver-ccp/datasource datasource.git
$ cd datasource.git
$ mvn clean install

Scripts must be run from the project's base directory.

If you are interested in programmatic access to the file parsers and related code, the libraries are available as Maven artifacts:

Maven signature if only using the file parser API

<dependency>
	<groupId>edu.ucdenver.ccp</groupId>
	<artifactId>datasource-fileparsers</artifactId>
	<version>0.6.1</version>
</dependency>

<repository>
	<id>bionlp-sourceforge</id>
	<url>http://svn.code.sf.net/p/bionlp/code/repo/</url>
</repository>

Maven signature if interested in generating RDF of parsed file content

<dependency>
	<groupId>edu.ucdenver.ccp</groupId>
	<artifactId>datasource-rdfizer</artifactId>
	<version>0.6.1</version>
</dependency>

<repository>
	<id>bionlp-sourceforge</id>
	<url>http://svn.code.sf.net/p/bionlp/code/repo/</url>
</repository>

Development

This project follows the Git-Flow approach to branching as originally described here. To facilitate the Git-Flow branching approach, this project makes use of the jgitflow-maven-plugin as described here.

Code in the master branch reflects the latest release (v0.6.1) of this library. Code in the development branch contains the most up-to-date version of this project.

Available file parsers

This library contains file parsers for files from many different biomedical databases. The table below lists the datasources, files, and relevant file parser class. Many of the file parsers are capable of automatically downloading the file that they parse. Those files that cannot be downloaded automatically typically require registration, login, or a user-specific license. The "Download" column is used to indicate which files cannot be downloaded automatically. This list is not guaranteed to be exhaustive.

<sub<>Data source File Parser class RDF Generation Key Download
DIP dip{DATE}.txt.gz DipYYYYMMDDFileParser MANUAL
DrugBank drugbank.xml DrugbankXmlFileRecordReader DRUGBANK AUTO
Gene Ontology annotation files GeneAssociationFileParser AUTO
GOA gp_association.goa_uniprot.gz GpAssociationGoaUniprotFileParser GOA AUTO
HGNC hgnc_complete_set.txt.gz HgncDownloadFileParser HGNC AUTO
InterPro interpro2go InterPro2GoFileParser INTERPRO_INTERPRO2GO AUTO
InterPro names.dat InterProNamesDatFileParser INTERPRO_NAMESDAT AUTO
InterPro protein2ipr.dat.gz InterProProtein2IprDatFileParser INTERPRO_PROTEIN2IPR AUTO
IRefWeb All.mitab.{DATE}.txt.zip IRefWebPsiMitab2_6FileParser IREFWEB AUTO
MGI MGI_EntrezGene.rpt MGIEntrezGeneFileParser MGI_ENTREZGENE AUTO
MGI MGI_Geno_Disease.rpt MGIGenoDiseaseFileRecordReader AUTO
MGI MGI_PhenoGenoMP.rpt MGIPhenoGenoMPFileParser MGI_MGIPHENOGENO AUTO
MGI MRK_List2.rpt MRKListFileParser MGI_MRKLIST AUTO
MGI MRK_Reference.rpt MRKReferenceFileParser MGI_MRKREFERENCE AUTO
MGI MRK_Sequence.rpt MRKSequenceFileParser MGI_MRKSEQUENCE AUTO
MGI MRK_SwissProt.rpt MRKSwissProtFileParser MGI_MRKSWISSPROT AUTO
miRBase miRNA.dat.gz MirBaseMiRnaDatFileParser MIRBASE AUTO
NCBI Gene gene2accession.gz EntrezGene2AccessionFileParser AUTO
NCBI Gene gene2pubmed.gz EntrezGene2PubmedFileParser AUTO
NCBI Gene gene2refseq.gz EntrezGene2RefseqFileParser NCBIGENE_GENE2REFSEQ AUTO
NCBI Gene gene_info.gz EntrezGeneInfoFileParser NCBIGENE_GENEINFO AUTO
NCBI Gene mim2gene_medgen EntrezGeneMim2GeneFileParser NCBIGENE_MIM2GENE AUTO
NCBI Gene gene_refseq_uniprotkb_collab.gz EntrezGeneRefSeqUniprotKbCollabFileParser NCBIGENE_REFSEQUNIPROTCOLLAB AUTO
NCBI Homologene homologene.data HomoloGeneDataFileParser HOMOLOGENE AUTO
NCBI RefSeq RefSeq-release{##}.catalog.gz RefSeqReleaseCatalogFileParser REFSEQ_RELEASECATALOG AUTO
PharmGKB diseases.tsv PharmGkbDiseaseFileParser PHARMGKB_DISEASE AUTO
PharmGKB drugs.tsv PharmGkbDrugFileParser PHARMGKB_DRUG AUTO
PharmGKB genes.tsv PharmGkbGeneFileParser PHARMGKB_GENE AUTO
PharmGKB relations.tsv PharmGkbRelationFileParser PHARMGKB_RELATION MANUAL
PhosphoSite Acetylation_site_dataset.gz AcetylationPhosphositeFileParser MANUAL
PhosphoSite Disease-associated_sites.gz DiseasePhosphositeFileParser MANUAL
PhosphoSite Kinase_Substrate_Dataset.gz KinasePhosphositeFileParser MANUAL
PhosphoSite Methylation_site_dataset.gz MethylationPhosphositeFileParser MANUAL
PhosphoSite O-GalNAc_site_dataset.gz OGalNAcPhosphositeFileParser MANUAL
PhosphoSite O-GlcNAc_site_dataset.gz OGlcNAcPhosphositeFileParser MANUAL
PhosphoSite Phosphorylation_site_dataset.gz PhosphorylationPhosphositeFileParser MANUAL
PhosphoSite Regulatory_sites.gz RegulatoryPhosphositeFileParser MANUAL
PhosphoSite Sumoylation_site_dataset.gz SumoylationPhosphositeFileParser MANUAL
PhosphoSite Ubiquitination_site_dataset.gz UbiquitinationPhosphositeFileParser MANUAL
PreMod human_module_tab.txt.gz HumanPReModModuleTabFileParser PREMOD_HUMAN AUTO
PreMod mouse_module_tab.txt.gz MousePReModModuleTabFileParser PREMOD_MOUSE AUTO
Protein Ontology promapping.txt ProMappingFileParser PR_MAPPINGFILE AUTO
Reactome UniProt2Reactome.txt ReactomeUniprot2PathwayStidTxtFileParser REACTOME_UNIPROT2PATHWAYSTID AUTO
RGD GENES_RAT.txt RgdRatGeneFileRecordReader RGD_GENES AUTO
UniProt uniprot_sprot.xml.gz SwissProtXmlFileRecordReader UNIPROT_SWISSPROT AUTO
UniProt uniprot_trembl.xml.gz TremblXmlFileRecordReader UNIPROT_TREMBL_SPARSE AUTO
UniProt idmapping_selected.tab.gz UniProtIDMappingFileRecordReader UNIPROT_IDMAPPING AUTO

Generating RDF representations of parsed database files

This library also contains code that can convert file parser output into a structured database record/field representation using RDF.

The structure of the RDF is described in:

KaBOB: Ontology-Based Semantic Integration of Biomedical Databases
Kevin M Livingston, Michael Bada, William A Baumgartner, Lawrence E Hunter
BMC Bioinformatics (accepted)

And the generated RDF serves as a foundation for the KaBOB Knowledge Base of Biology. Detailed instructions on how to generate RDF to feed into KaBOB can be found below and here.

The following script can be used to generate RDF representation for a given data source file:

datasource-rdfizer/scripts/download-datasources-and-generate-triples.sh

Parameters:
  [-d]: The directory into which to place the downloaded datasource files.
  [-r]: The directory into which to place the RDF triples parsed from the 
        datasource files.
  [-i]: The names of the datasources (comma-delimited) to download and process; 
        if not specified, all available datasources will be downloaded and 
        processed. These names are listed in the "RDF Generation Key" column in 
        the table above.
  [-t]: A comma-separated list of NCBI taxonomy IDs. Only records for these IDs 
        will be included in the RDF triple output where applicable. If neither 
        -t nor -m is specified, all records will be included.
  [-m]: Include only human and the 7 model organisms (fly, rat, mouse, yeast, 
        worm, arabidopsis, and zebrafish) in the generated RDF. If neither -t 
        nor -m is specified, all records will be included.
  [-c]: Clean the data source files. If set, this flag will cause the data 
        source files to be re-downloaded prior to processing.

Data source files that are publicly available will be automatically downloaded and saved under the directory specified by the -d parameter. Data source files that require manual download must be manually placed under the directory specified by the -d parameter prior to RDF generation. Data source names that can be used as input to the -i parameter in the download-datasources-and-generate-triples.sh script are listed in the above table in the "RDF Generation Key" column. They can also be seen by running the following script:

datasource-rdfizer/scripts/list-datasource-names.sh

Example RDF Generation

miRBase RDF Generation

For example, to generate RDF for the MirBase database file:

$ export DATA_DIR=[BASE_DIRECTORY_WHERE_DATA_FILES_TO_PARSE_LIVE]
$ export RDF_DIR=[BASE_DIRECTORY_WHERE_RDF_WILL_BE_WRITTEN]
$ mkdir -p $DATA_DIR
$ mkdir -p $RDF_DIR
$ export DATE=[TODAYS_DATE_TO_TIMESTAMP_THE_DATA e.g. 2015-04-16]
$ mvn clean install
$ ./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
    -d $DATA_DIR \
    -r $RDF_DIR \
    -i MIRBASE

Note: you may need to adjust the Java Heap size in pom-rdf-gen.xml depending on the memory limitations of your hardware.

Species-specific subsets

It can sometimes be beneficial to limit RDF output to a specific species or group of species. Doing so can improve RDF generation time as well as limit the number of triples produced when parsing a file. Some of the file parsers are species-aware and the script allows one to specify the NCBI taxonomy ID of the species to which triple generation should be constrained. For example, to constrain output to UniProt ID mapping records that pertain only to human (NCBI taxonomy ID: 9606), run:

./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
    -d $DATA_DIR \
    -r $RDF_DIR \
    -i UNIPROT_IDMAPPING
    -t 9606

For human plus seven model organisms (fly, rat, mouse, yeast, worm, arabidopsis, and zebrafish), use the -m parameter:

./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
    -d $DATA_DIR \
    -r $RDF_DIR \
    -i UNIPROT_IDMAPPING
    -m

Note: when a taxon-aware file parser is used, some extra data is downloaded to ensure that the mappings from biological concepts to taxon identifiers are present. This download can be time consuming due to one of the files being very large, but it is a one-time cost.