A library of code for parsing (mostly biomedical) data source files
- Java, at least version 8, is required.
- Apache Maven is required to build the project.
- If you intend to build this project inside of an IDE, such as Eclipse, please see the instructions for using the Lombok library with your IDE here.
To use the scripts included in this project, e.g. to generate an RDF representation for a given datasource from the command line, you must download and install the project:
$ git clone https://github.com/UCDenver-ccp/datasource datasource.git
$ cd datasource.git
$ mvn clean install
Scripts must be run from the project's base directory.
If you are interested in programmatic access to the file parsers and related code, the libraries are available as Maven artifacts:
<dependency>
<groupId>edu.ucdenver.ccp</groupId>
<artifactId>datasource-fileparsers</artifactId>
<version>0.6.1</version>
</dependency>
<repository>
<id>bionlp-sourceforge</id>
<url>http://svn.code.sf.net/p/bionlp/code/repo/</url>
</repository>
<dependency>
<groupId>edu.ucdenver.ccp</groupId>
<artifactId>datasource-rdfizer</artifactId>
<version>0.6.1</version>
</dependency>
<repository>
<id>bionlp-sourceforge</id>
<url>http://svn.code.sf.net/p/bionlp/code/repo/</url>
</repository>
This project follows the Git-Flow approach to branching as originally described here. To facilitate the Git-Flow branching approach, this project makes use of the jgitflow-maven-plugin as described here.
Code in the master branch reflects the latest release (v0.6.1) of this library. Code in the development branch contains the most up-to-date version of this project.
This library contains file parsers for files from many different biomedical databases. The table below lists the datasources, files, and relevant file parser class. Many of the file parsers are capable of automatically downloading the file that they parse. Those files that cannot be downloaded automatically typically require registration, login, or a user-specific license. The "Download" column is used to indicate which files cannot be downloaded automatically. This list is not guaranteed to be exhaustive.
This library also contains code that can convert file parser output into a structured database record/field representation using RDF.
The structure of the RDF is described in:
KaBOB: Ontology-Based Semantic Integration of Biomedical Databases
Kevin M Livingston, Michael Bada, William A Baumgartner, Lawrence E Hunter
BMC Bioinformatics (accepted)
And the generated RDF serves as a foundation for the KaBOB Knowledge Base of Biology. Detailed instructions on how to generate RDF to feed into KaBOB can be found below and here.
The following script can be used to generate RDF representation for a given data source file:
datasource-rdfizer/scripts/download-datasources-and-generate-triples.sh
Parameters:
[-d]: The directory into which to place the downloaded datasource files.
[-r]: The directory into which to place the RDF triples parsed from the
datasource files.
[-i]: The names of the datasources (comma-delimited) to download and process;
if not specified, all available datasources will be downloaded and
processed. These names are listed in the "RDF Generation Key" column in
the table above.
[-t]: A comma-separated list of NCBI taxonomy IDs. Only records for these IDs
will be included in the RDF triple output where applicable. If neither
-t nor -m is specified, all records will be included.
[-m]: Include only human and the 7 model organisms (fly, rat, mouse, yeast,
worm, arabidopsis, and zebrafish) in the generated RDF. If neither -t
nor -m is specified, all records will be included.
[-c]: Clean the data source files. If set, this flag will cause the data
source files to be re-downloaded prior to processing.
Data source files that are publicly available will be automatically downloaded and saved under
the directory specified by the -d
parameter. Data source files that require manual download
must be manually placed under the directory specified by the -d
parameter prior to RDF generation.
Data source names that can be used as input to the -i
parameter in the download-datasources-and-generate-triples.sh
script are listed in the above
table in the "RDF Generation Key" column. They can also be seen by running the following script:
datasource-rdfizer/scripts/list-datasource-names.sh
For example, to generate RDF for the MirBase database file:
$ export DATA_DIR=[BASE_DIRECTORY_WHERE_DATA_FILES_TO_PARSE_LIVE]
$ export RDF_DIR=[BASE_DIRECTORY_WHERE_RDF_WILL_BE_WRITTEN]
$ mkdir -p $DATA_DIR
$ mkdir -p $RDF_DIR
$ export DATE=[TODAYS_DATE_TO_TIMESTAMP_THE_DATA e.g. 2015-04-16]
$ mvn clean install
$ ./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
-d $DATA_DIR \
-r $RDF_DIR \
-i MIRBASE
Note: you may need to adjust the Java Heap size in pom-rdf-gen.xml depending on the memory limitations of your hardware.
It can sometimes be beneficial to limit RDF output to a specific species or group of species. Doing so can improve RDF generation time as well as limit the number of triples produced when parsing a file. Some of the file parsers are species-aware and the script allows one to specify the NCBI taxonomy ID of the species to which triple generation should be constrained. For example, to constrain output to UniProt ID mapping records that pertain only to human (NCBI taxonomy ID: 9606), run:
./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
-d $DATA_DIR \
-r $RDF_DIR \
-i UNIPROT_IDMAPPING
-t 9606
For human plus seven model organisms (fly, rat, mouse, yeast, worm,
arabidopsis, and zebrafish), use the -m
parameter:
./datasource-rdfizer/scripts/download-datasources-and-generate-triples \
-d $DATA_DIR \
-r $RDF_DIR \
-i UNIPROT_IDMAPPING
-m
Note: when a taxon-aware file parser is used, some extra data is downloaded to ensure that the mappings from biological concepts to taxon identifiers are present. This download can be time consuming due to one of the files being very large, but it is a one-time cost.