IMPPROVE Container

Overview

This container applies IMPPROVE-trained random forest models to VCF files, then saves and aggregates the predictions.

The input VCFs must use the Hg38 reference genome. This is validated when running the filter command, which automates the application of slivar with the recommended rare-disease slivar filter:

bcftools csq --samples - --ncsq 40 --gff-annot $annot_data_file --local-csq --fasta-ref $genome_data $vcf_input_file --output-type u |
slivar expr --vcf - -g $gnomad_data_file --info 'INFO.impactful && INFO.gnomad_popmax_af < 0.01 && variant.QUAL >= 20 && variant.ALT[0] != "*"' -o $outfile

In the command above, $vcf_input_file is a placeholder for the VCF file being processed. $annot_data_file is the Ensembl Hg38 annotation file (Homo_sapiens.GRCh38.108.gff3.gz), and $gnomad_data_file is either an Ensembl-style (Hg38p14e.fa) or UCSC-style (Hg38p14.fa) Hg38 FASTA file. All three reference files can be found in the distributed data directory.

Running bcftools and slivar

bcftools and slivar can both be downloaded from the internet, but are also baked into this container.

Running the container image

The container image along with the required data and models can be downloaded as an 84 GiB bundle from https://doi.org/10.26182/37wm-bz23.

This container can be run with any docker-archive compatible software (e.g. docker, podman, apptainer). Here I will use podman.

The image can be loaded with the load command, then run.

gunzip apply-impprove.tar.gz
podman load --input apply-impprove.tar
podman run localhost/apply-impprove --help

It is also possible to create an Apptainer/Singularity container from this container image tarball, see https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-in-docker-archive-files for more information.

apptainer build apply-impprove.sif docker-archive:apply-impprove.tar

Help

To see the help of the main entrypoint, simply run the container with the --help flag. As mentioned in the help information, this README can be retrieved by running the container with the argument --readme.

Setup

In order to apply IMPPROVE random forest models, this container needs two sets of data:

  • IMPPROVE models to apply, and
  • A small collection of datasets needed to prepare the input

The “small collection” of datasets consists of ~1 GB of tabular files found at tkiHOHPC2111:/data/scratch/timothy/minimal_impprove_data/, and a cadd.hdf5 file (see tkiHOHPC2111:/data/scratch/timothy/cadd_small.hdf5/). The cadd.hdf5 should be put in (not symlinked) the minimal_impprove_data folder. This can then be mounted as the /data volume.

podman run \
    -v ~/Downloads/impprove_models/some_particular_set:/models \
    -v ~/Downloads/minimal_impprove_data:/data \
    -v ~/Documents/impprove_run/predictions:/predictions \
    -v ~/Documents/impprove_run/vcfs:/vcfs \
    localhost/apply-impprove

The /models volume should be a directory of per-HPO models, and look something like this:

/models
├── 2.jlso
├── 3.jlso
├── 8.jlso
⋮
└── 3000050.jlso

A selection of pre-trained models are available under tkiHOHPC2111:/data/scratch/timothy/impprove_pretrained_models.

Output

For each input VCF, an output folder is created within the /predictions volume. Each folder will have a structure something like this:

/predictions/input_file_name.vcf
├── all-predictions.csv
├── all-preds.png
├── confusion-matrices.csv
├── hpo-descriptions.csv
├── model-scores.csv
├── top-50-preds.png└
├── variable-importances.csv
└── wmean-good-prediction.csv

all-predictions.csv will contain information on each variant of the VCF, starting with the Ensemble Gene ID and the gene name, and then after giving the variant itself (in terms of the chromosome, location, ref and alt) each subsequent “HP:XXXXXXX” column gives the predictions from the model build for that HPO term. A counterpart to this is confusion-matrices.csv, which provides confusion matrices for each prediction, using it’s score as a threshold applied to the OOB predictions.

Each model used has an associated self-evaluation score from the bootstrapped test/train process, these scores are given in model-scores.csv. The variable importances for each model are found in variable-importances.csv. Each model is labelled according to its HPO id, descriptions for the HPO ids used can be found in hpo-descriptions.csv.

The score-weighted average prediction across models with a score of at least 0.4 is given in wmean-good-prediction.csv.

top-50-preds.png is a heatmap of all predictions for the top 50 variants overall, while all-preds.png is the heatmap for everything.

Sample

A small sample VCF is bundled with this container for testing, and can be fetched by running the container with the argument --sample. It contains three known pathogenic variants, and 5000 benign variants.