A library of code for parsing (mostly biomedical) data source files

Prerequisites

Java, at least version 8, is required.
Apache Maven is required to build the project.
If you intend to build this project inside of an IDE, such as Eclipse, please see the instructions for using the Lombok library with your IDE here.

Installation

To use the scripts included in this project, e.g. to generate an RDF representation for a given datasource from the command line, you must download and install the project:

$ git clone https://github.com/UCDenver-ccp/datasource datasource.git
$ cd datasource.git
$ mvn clean install

Scripts must be run from the project's base directory.

If you are interested in programmatic access to the file parsers and related code, the libraries are available as Maven artifacts:

Maven signature if only using the file parser API

<dependency>
	<groupId>edu.ucdenver.ccp</groupId>
	<artifactId>datasource-fileparsers</artifactId>
	<version>0.6.1</version>
</dependency>

<repository>
	<id>bionlp-sourceforge</id>
	<url>http://svn.code.sf.net/p/bionlp/code/repo/</url>
</repository>

Maven signature if interested in generating RDF of parsed file content

<dependency>
	<groupId>edu.ucdenver.ccp</groupId>
	<artifactId>datasource-rdfizer</artifactId>
	<version>0.6.1</version>
</dependency>

<repository>
	<id>bionlp-sourceforge</id>
	<url>http://svn.code.sf.net/p/bionlp/code/repo/</url>
</repository>

Development

This project follows the Git-Flow approach to branching as originally described here. To facilitate the Git-Flow branching approach, this project makes use of the jgitflow-maven-plugin as described here.

Code in the master branch reflects the latest release (v0.6.1) of this library. Code in the development branch contains the most up-to-date version of this project.

Available file parsers

This library contains file parsers for files from many different biomedical databases. The table below lists the datasources, files, and relevant file parser class. Many of the file parsers are capable of automatically downloading the file that they parse. Those files that cannot be downloaded automatically typically require registration, login, or a user-specific license. The "Download" column is used to indicate which files cannot be downloaded automatically. This list is not guaranteed to be exhaustive.

<sub<>Data source	_File	_{Parser class}	_{RDF Generation Key}	_Download
_DIP	_{dip{DATE}.txt.gz}	_{DipYYYYMMDDFileParser}		_MANUAL
_DrugBank	_drugbank.xml	_{DrugbankXmlFileRecordReader}	_DRUGBANK	_AUTO
_{Gene Ontology}	_{annotation files}	_{GeneAssociationFileParser}		_AUTO
_GOA	_{gp_association.goa_uniprot.gz}	_{GpAssociationGoaUniprotFileParser}	_GOA	_AUTO
_HGNC	_{hgnc_complete_set.txt.gz}	_{HgncDownloadFileParser}	_HGNC	_AUTO
_InterPro	_interpro2go	_{InterPro2GoFileParser}	_{INTERPRO_INTERPRO2GO}	_AUTO
_InterPro	_names.dat	_{InterProNamesDatFileParser}	_{INTERPRO_NAMESDAT}	_AUTO
_InterPro	_{protein2ipr.dat.gz}	_{InterProProtein2IprDatFileParser}	_{INTERPRO_PROTEIN2IPR}	_AUTO
_IRefWeb	_{All.mitab.{DATE}.txt.zip}	_{IRefWebPsiMitab2_6FileParser}	_IREFWEB	_AUTO
_MGI	_{MGI_EntrezGene.rpt}	_{MGIEntrezGeneFileParser}	_{MGI_ENTREZGENE}	_AUTO
_MGI	_{MGI_Geno_Disease.rpt}	_{MGIGenoDiseaseFileRecordReader}		_AUTO
_MGI	_{MGI_PhenoGenoMP.rpt}	_{MGIPhenoGenoMPFileParser}	_{MGI_MGIPHENOGENO}	_AUTO
_MGI	_{MRK_List2.rpt}	_{MRKListFileParser}	_{MGI_MRKLIST}	_AUTO
_MGI	_{MRK_Reference.rpt}	_{MRKReferenceFileParser}	_{MGI_MRKREFERENCE}	_AUTO
_MGI	_{MRK_Sequence.rpt}	_{MRKSequenceFileParser}	_{MGI_MRKSEQUENCE}	_AUTO
_MGI	_{MRK_SwissProt.rpt}	_{MRKSwissProtFileParser}	_{MGI_MRKSWISSPROT}	_AUTO
_miRBase	_miRNA.dat.gz	_{MirBaseMiRnaDatFileParser}	_MIRBASE	_AUTO
_{NCBI Gene}	_{gene2accession.gz}	_{EntrezGene2AccessionFileParser}		_AUTO
_{NCBI Gene}	_{gene2pubmed.gz}	_{EntrezGene2PubmedFileParser}		_AUTO
_{NCBI Gene}	_{gene2refseq.gz}	_{EntrezGene2RefseqFileParser}	_{NCBIGENE_GENE2REFSEQ}	_AUTO
_{NCBI Gene}	_{gene_info.gz}	_{EntrezGeneInfoFileParser}	_{NCBIGENE_GENEINFO}	_AUTO
_{NCBI Gene}	_{mim2gene_medgen}	_{EntrezGeneMim2GeneFileParser}	_{NCBIGENE_MIM2GENE}	_AUTO
_{NCBI Gene}	_{gene_refseq_uniprotkb_collab.gz}	_{EntrezGeneRefSeqUniprotKbCollabFileParser}	_{NCBIGENE_REFSEQUNIPROTCOLLAB}	_AUTO
_{NCBI Homologene}	_{homologene.data}	_{HomoloGeneDataFileParser}	_HOMOLOGENE	_AUTO
_{NCBI RefSeq}	_{RefSeq-release{##}.catalog.gz}	_{RefSeqReleaseCatalogFileParser}	_{REFSEQ_RELEASECATALOG}	_AUTO
_PharmGKB	_diseases.tsv	_{PharmGkbDiseaseFileParser}	_{PHARMGKB_DISEASE}	_AUTO
_PharmGKB	_drugs.tsv	_{PharmGkbDrugFileParser}	_{PHARMGKB_DRUG}	_AUTO
_PharmGKB	_genes.tsv	_{PharmGkbGeneFileParser}	_{PHARMGKB_GENE}	_AUTO
_PharmGKB	_{relations.tsv}	_{PharmGkbRelationFileParser}	_{PHARMGKB_RELATION}	_MANUAL
_PhosphoSite	_{Acetylation_site_dataset.gz}	_{AcetylationPhosphositeFileParser}		_MANUAL
_PhosphoSite	_{Disease-associated_sites.gz}	_{DiseasePhosphositeFileParser}		_MANUAL
_PhosphoSite	_{Kinase_Substrate_Dataset.gz}	_{KinasePhosphositeFileParser}		_MANUAL
_PhosphoSite	_{Methylation_site_dataset.gz}	_{MethylationPhosphositeFileParser}		_MANUAL
_PhosphoSite	_{O-GalNAc_site_dataset.gz}	_{OGalNAcPhosphositeFileParser}		_MANUAL
_PhosphoSite	_{O-GlcNAc_site_dataset.gz}	_{OGlcNAcPhosphositeFileParser}		_MANUAL
_PhosphoSite	_{Phosphorylation_site_dataset.gz}	_{PhosphorylationPhosphositeFileParser}		_MANUAL
_PhosphoSite	_{Regulatory_sites.gz}	_{RegulatoryPhosphositeFileParser}		_MANUAL
_PhosphoSite	_{Sumoylation_site_dataset.gz}	_{SumoylationPhosphositeFileParser}		_MANUAL
_PhosphoSite	_{Ubiquitination_site_dataset.gz}	_{UbiquitinationPhosphositeFileParser}		_MANUAL
_PreMod	_{human_module_tab.txt.gz}	_{HumanPReModModuleTabFileParser}	_{PREMOD_HUMAN}	_AUTO
_PreMod	_{mouse_module_tab.txt.gz}	_{MousePReModModuleTabFileParser}	_{PREMOD_MOUSE}	_AUTO
_{Protein Ontology}	_{promapping.txt}	_{ProMappingFileParser}	_{PR_MAPPINGFILE}	_AUTO
_Reactome	_{UniProt2Reactome.txt}	_{ReactomeUniprot2PathwayStidTxtFileParser}	_{REACTOME_UNIPROT2PATHWAYSTID}	_AUTO
_RGD	_{GENES_RAT.txt}	_{RgdRatGeneFileRecordReader}	_{RGD_GENES}	_AUTO
_UniProt	_{uniprot_sprot.xml.gz}	_{SwissProtXmlFileRecordReader}	_{UNIPROT_SWISSPROT}	_AUTO
_UniProt	_{uniprot_trembl.xml.gz}	_{TremblXmlFileRecordReader}	_{UNIPROT_TREMBL_SPARSE}	_AUTO
_UniProt	_{idmapping_selected.tab.gz}	_{UniProtIDMappingFileRecordReader}	_{UNIPROT_IDMAPPING}	_AUTO

Generating RDF representations of parsed database files

This library also contains code that can convert file parser output into a structured database record/field representation using RDF.

The structure of the RDF is described in:

KaBOB: Ontology-Based Semantic Integration of Biomedical Databases
Kevin M Livingston, Michael Bada, William A Baumgartner, Lawrence E Hunter
BMC Bioinformatics (accepted)

And the generated RDF serves as a foundation for the KaBOB Knowledge Base of Biology. Detailed instructions on how to generate RDF to feed into KaBOB can be found below and here.

The following script can be used to generate RDF representation for a given data source file:

datasource-rdfizer/scripts/download-datasources-and-generate-triples.sh

Parameters:
  [-d]: The directory into which to place the downloaded datasource files.
  [-r]: The directory into which to place the RDF triples parsed from the 
        datasource files.
  [-i]: The names of the datasources (comma-delimited) to download and process; 
        if not specified, all available datasources will be downloaded and 
        processed. These names are listed in the "RDF Generation Key" column in 
        the table above.
  [-t]: A comma-separated list of NCBI taxonomy IDs. Only records for these IDs 
        will be included in the RDF triple output where applicable. If neither 
        -t nor -m is specified, all records will be included.
  [-m]: Include only human and the 7 model organisms (fly, rat, mouse, yeast, 
        worm, arabidopsis, and zebrafish) in the generated RDF. If neither -t 
        nor -m is specified, all records will be included.
  [-c]: Clean the data source files. If set, this flag will cause the data 
        source files to be re-downloaded prior to processing.

Data source files that are publicly available will be automatically downloaded and saved under the directory specified by the -d parameter. Data source files that require manual download must be manually placed under the directory specified by the -d parameter prior to RDF generation. Data source names that can be used as input to the -i parameter in the download-datasources-and-generate-triples.sh script are listed in the above table in the "RDF Generation Key" column. They can also be seen by running the following script:

datasource-rdfizer/scripts/list-datasource-names.sh

Example RDF Generation

miRBase RDF Generation

For example, to generate RDF for the MirBase database file:

$ export DATA_DIR=[BASE_DIRECTORY_WHERE_DATA_FILES_TO_PARSE_LIVE]
$ export RDF_DIR=[BASE_DIRECTORY_WHERE_RDF_WILL_BE_WRITTEN]
$ mkdir -p $DATA_DIR
$ mkdir -p $RDF_DIR
$ export DATE=[TODAYS_DATE_TO_TIMESTAMP_THE_DATA e.g. 2015-04-16]
$ mvn clean install
$ ./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
    -d $DATA_DIR \
    -r $RDF_DIR \
    -i MIRBASE

Note: you may need to adjust the Java Heap size in pom-rdf-gen.xml depending on the memory limitations of your hardware.

Species-specific subsets

It can sometimes be beneficial to limit RDF output to a specific species or group of species. Doing so can improve RDF generation time as well as limit the number of triples produced when parsing a file. Some of the file parsers are species-aware and the script allows one to specify the NCBI taxonomy ID of the species to which triple generation should be constrained. For example, to constrain output to UniProt ID mapping records that pertain only to human (NCBI taxonomy ID: 9606), run:

./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
    -d $DATA_DIR \
    -r $RDF_DIR \
    -i UNIPROT_IDMAPPING
    -t 9606

For human plus seven model organisms (fly, rat, mouse, yeast, worm, arabidopsis, and zebrafish), use the -m parameter:

./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
    -d $DATA_DIR \
    -r $RDF_DIR \
    -i UNIPROT_IDMAPPING
    -m

Note: when a taxon-aware file parser is used, some extra data is downloaded to ensure that the mappings from biological concepts to taxon identifiers are present. This download can be time consuming due to one of the files being very large, but it is a one-time cost.

ekwhite/datasource