/phenoscoring

incremental data integration for scoring genotype-disease associations

Primary LanguagePython

Phenoscoring

Phenoscoring is a suite of programs for incremental data integration, adapted for use with phenotype data and for tracking model-disease associations.

Incremental data integration is a concept for how to summarize datasets that can change over time. Suppose, for example, that information at an initial moment in time can be collapsed into an association score. When new evidence becomes available that is consistent with the previous data, it is natural to presume that a recomputed association score should reflect that. However, this is not necessarily the case when the score is computed with commonly used integration approaches such as averaging, consensus, or normalization - those approaches only work well at a single moment in time. Incremental data integration is a pivot toward tracking associations in a way that is consistent at all time points, or all stages of a dataset lifecycle.

The focus on consistency at multiple stages in a dataset lifecycle has some strong consequences on how association scores can be defined. In shoft, the focus shifts shifts toward cumulative summaries as opposed to normalized quantities. Phenoscoring is an exploration of these consequences in a specific context where the data consists of phenotypes associated with mouse models and the goal is to track the similarity of these models to human disease.

Installation

Programs in the phenoscoring suite requires python 3.6 to run. You can check your version of python using one of the following commands,

python --version
python3 --version

Some third party packages are also required. These can be installed using the pip package manager

pip install jsonpickle numpy numba

The software can be installed by cloning from the source repo.

git clone https://www.github.com/tkonopka/Phenoscoring

No further installation procedures are required. However, running the software successfully does require the assembly of a number of required data files, described below and the linked documentation pages.

Usage

The repo root contains a number of executables. A typical workflow relies on just two of these programs.

The repo also contains other executables. These are auxiliary scripts that can be relevant for debugging, post-processing, or supplementary calculations.

  • obotools.py is set of miscellanous tools for handling ontology obo files
  • download_GXD.py fetches expression data in mouse tissues
  • phenojoin.py prepares data files with modesl supported by IMPC and MGI

Applications

Algorithms and applications are described in a manuscript.

Konopka T, Smedley D. Incremental data integration for tracking genotype-disease associations. PLOS Comp Bio.