This repository supports the HeartENN models published with the Richter et al. (2020) publication, Genomic analyses implicate noncoding de novo variants in congenital heart disease.
If you are interested in running HeartENN to make additional predictions, we recommend HeartENN models be used with Selene version 0.0.0 (already specified as a submodule for this repository). To use this repository along with the specified version of Selene, you can run
git clone --recursive git@github.com:FunctionLab/HeartENN-models.git
(If you have problems loading the Selene submodule in this way, you can instead manually clone and load this branch of Selene, which contains one modification to the original release that allows users to load these models with newer versions of PyTorch.)
If you only need to make comparisons against the predictions published in the manuscript, we recommend downloading the full set directly from this Zenodo record. Please note that variant effect prediction was originally run with the variants specified in hg19 coordinates, but the Supplementary Table provided in the publication specifies hg38 coordinates. To match the two, you'll need to liftOver
the coordinates in the Zenodo TSVs. You
will need to combine the human and mouse model TSVs (see 'How to use the output') and get the max score per variant to get the "HeartENN score" referred to in the publication.
Please post in the Github issues or e-mail Kathy Chen (kc31@princeton.edu) directly with any questions. Note that this repository is still a work-in-progress, with minimal instructions for getting started. We are happy to work with you by email until everything is stable and all documentation is pushed.
- Make sure you have
numpy
,cython
, anddocopt
installed in your conda environment. (I've included aheartenn-env.yml
example conda spec file that you can try using, but in practice I've found environment YAML files aren't easy to use out-of-the-box.) - Build Selene:
python setup.py build_ext --inplace
- Locally install Selene:
python setup.py install
- We've included the VCF files analyzed in the HeartENN publication in the directory
./run_variant_effect_prediction/data
. These are specified in hg19 coordinates, so you should download and use an hg19 FASTA as input intovep_cli.py
for all of these VCFs. For example:
cd run_variant_effect_prediction
python ./vep_cli.py ./data/dnvs.vcf <path-to-file>/hg19.fa ./output
Add the --cuda
flag if you can run the script on a CUDA-enabled GPU. It will take substantially longer on CPU (should only take a few mins to run on a GPU node).
After vep_cli.py
successfully runs, you will see two different output directories within output
: human_model
and mouse_model
. This is because HeartENN includes both chromatin features measured in human and mouse, for which the training is done separately. The generated .ref
and .alt
files in these directories are the separate ref
and alt
sequence predictions, which are used to compute the abs_diffs.tsv
file of variant effect predictions (|alt - ref|
).
To use the results the way we did so in the HeartENN publication, combine the predictions for human and mouse into a single TSV. Example code snippet:
import pandas as pd
HMODEL_FILE = <path to human_model/*abs_diff.tsv>
MMODEL_FILE = <path to mouse_model/*abs_diff.tsv>
hdf = pd.read_csv(HMODEL_FILE, sep='\t')
mdf = pd.read_csv(MMODEL_FILE, sep='\t')
hdf = hdf.set_index(['chrom', 'pos', 'name', 'ref', 'alt'])
mdf = mdf.set_index(['chrom', 'pos', 'name', 'ref', 'alt'])
combined_df = pd.concat([hdf, mdf], axis=1)
Note that the HeartENN chromatin profile names do not specify mouse vs human, so you could update the hdf
and mdf
column names to distinguish between the two if desired.
The 'HeartENN score' is the max abs_diff
value across all chromatin features for each variant.
- You will need to download hg19 and mm9 FASTA files and specify these paths (depending on which HeartENN model you run the evaluation on) as input to the
eval_cli.py
script. - Example command:
cd run_model_evaluation
python ./eval_cli.py ./heartenn1_mouse_eval.bed.yml
<path-to-file>/mm9.fa
./mouse_eval_outputs
--cuda
- Please note that the Python scripts and configuration YAML files included in these repositories contain a number of relative path specifications, in Linux-style paths. These must be updated if you move this directory outside of its original source download.
- For model evaluation: the configuration files specify as input the
*_testset.selene.bed
files. These are generated using Selene's IntervalsSampler, where each row in the BED file is the input sequence coordinates (in hg19 or mm9, respectively) and the chromatin feature classes measured in that sequence (semicolon-separated, each number is the 0-indexed row number in the model's chromatin features file).