HOP: the High-throughput Oligogenic Prioritizer
This repository provides the code to run the HOP predictor and reproduce the results presented in the paper "Prioritization of Oligogenic Variant Combinations in Exomes"
1) Requirements
Data requirements
The knowledge graph BOCK needs to be downloaded from zenodo (https://doi.org/10.5281/zenodo.7185680) -> BOCK-graphml
extracted and stored in the folder data/BOCK_data/
under the name bock.graphml
or the config.ini file should
be updated.
In order to run the tool, patient exomes need to be annotated with features used by the VarCoPP2.0 predictor.
Annotated files for the
1KGP exomes that were used as templates to generate the synthetic exomes, as well as annotated files for the
OLIDA combinations to be inserted in the 1KGP exomes in order to be able to reproduce the data
presented in the paper can be downloaded from zenodo : https://doi.org/10.5281/zenodo.8121283
The annotated_data archive should be downloaded and the content extracted in the data
directory
(tar -xvf annotated_data.tar.gz
). If you save the data elsewhere, the config.ini file should be updated.
This archive contains annotated data for 100 exomes (~45GB when extracted).
The annotated UK10K patients can not be shared because of the UK10K data access policy. The annotated UK10K exome files can be shared upon request if the requester can prove to have authorized access to the data.
Python dependencies
Python>=3.7 The following scripts have only been tested on unix environments (MacOS and linux) and we do not offer support for windows users.
We recommend installing the library dependencies inside a conda environment by following these steps:
conda create --name <env> python=3.7
conda activate <env>
conda install -c conda-forge graph-tool
pip install -r requirements.txt
2) Running the code
The different types of use of the scripts can be accessed from the hop.py
script.
We provide for 3 main usages:
- Running the HOP predictor to predict the cross-validation exomes
- Running the HOP predictor to prioritize variants in the independent set exomes
- Running the HOP predictor on any one exome generated from 1KGP and OLIDA combinations with user-defined seeds.
These three different usages can be perfomed via:
python hop.py <crossval | indep | predict>
Different arguments can then be provided depending on the action.
Common arguments for all actions:
-p
--patient_name
(required): ID of the template patient (it has to be the ID for which the annotations are provided)-rf
--results_folder
(required): The folder in which you want to save the results. It will be created in thedata
folder.-r
--restart
(optional): The value of the restart parameter for the RWR algorithm (default = 0.3), value has to be between 0 and 1.-n
--nb_top
(optional): Number of combinations to output as top ranked combinations. Default is 100.-wr
--write_raw
(optional): Use the flag if you want the raw results to be written, i.e. all score values for all combinatons in the exome should be written. Defaults to False (see Important Note below)
Important note: With the "write_raw" option, the predictor outputs a file with all generated combinations and their pathogenicity, disease-relevance and final scores. For some exomes, this can generate extremely large files even if they are gzipped. If you are running this code on your own laptop or have little space available, it might be preferable to not use this option.
Reproducing paper results
Cross-validation results
The action crossval
will, for one exome template, divide the training set in 10 folds, train 10 VarCoPP2.0 models,
and sequentially insert an OLIDA combination in the exome template, predict the exome with the relevant
VarCoPP2.0 model and HOP scores.
Specific arguments:
-st
--seed_type
(optional): Which type of seeds should be use by HOP. Options are "All", "HPO", "Panel" and "HPO+panel". Defaults to "All"-o
--all_operators
(optional): Boolean option for whether you want the different types of operators (MAX, MIN, AVGE, MULT) to be computed for the computation of the FinalScore. Defaults to False.-fs
--final_score
(optional): In the case when "all" is used for the seed type and/or if the "all_operators" option is selected, this parameter allows to define the seed and operator used for the ranking of the combinations in the top N output. Options can be specified as <seed_type>_. Defaults to HPO+Panel_AVGE.
Example:
python hop.py crossval -p <patient_id> -rf results_crossval -n 500 -st All
This will create 301 files in the data/results_crossval
folder which will contain the top
500 prioritized combinations of each of the synthetic exomes generated from the exome template and the
OLIDA combinations belonging to the independent test set.
In addition, the training set will have been splitted in 10 folds saved in the "crossval_folds" folders and 10 VarCoPP2.0 models, corresponding to each of the folds will have been saved in the "varcopp_data" folder. This allows to run the same cross-validation on the same exome templates.
Independent set results
The action indep
will, for one template exome, insert sequentially the OLIDA
combinations from the independent test set in the exome, rank combinations using HOP
with the final VarCoPP2.0 model.
-st
--seed_type
: Which type of seeds should be use by HOP. Options are are "All", "HPO", "Panel" and "HPO+Panel".-fs
--final_score
(optional): In the case when "all" is used for the seed type and/or if the "all_operators" option is selected, this parameter allows to define the seed and operator used for the ranking of the combinations in the top N output. Options can be specify as <seed_type>_. Defaults to HPO+Panel_AVGE.
Example:
python hop.py indep -p <patient_id> -rf results_independent -n 500 -st All
This will output 119 files in the data/results_independent
folder which will contain the top
500 prioritized combinations of each of the synthetic exomes generated from the exome template and the
OLIDA combinations belonging to the independent test set.
Prioritizing a single exome
The predict
action will run HOP on a single exome created by inserting one OLIDA
combination in one template exome, and using the final VarCoPP2.0 predictor to compute the
Pathogenicity Score.
In addition to the main arguments you need to specify:
-c
--combination
(required): ID of the OLIDA combination to insert in the template exome-s
--seeds
(required): List of seeds to use for the RWR, separated by a comma. Should be a list of HPO id terms or Ensembl gene IDs. If the seeds are not found in the KG, the output ranking will be based on the Pathogenicity Score alone.
Example:
python hop.py predict -p <patient_id> -rf results_single -n 500 -c OLI002 -s HP:0002092
This will output one file in the results/results_single
folder with the top 500 ranked combinations
when the OLI002 is inserted in the exome of the patient with template <patient_id>.
4) Contributors
Main contributor to the code: Barbara Gravel
People who also contributed to the development of the code:
- Alexandre Renaux
5) License
This code is licensed under the MIT License (https://choosealicense.com/licenses/mit/).