This container applies IMPPROVE-trained random forest models to VCF files, then saves and aggregates the predictions.
The input VCFs must use the Hg38 reference genome. This is validated when
running the filter
command, which automates the application of slivar with the
recommended rare-disease slivar filter:
bcftools csq --samples - --ncsq 40 --gff-annot $annot_data_file --local-csq --fasta-ref $genome_data $vcf_input_file --output-type u |
slivar expr --vcf - -g $gnomad_data_file --info 'INFO.impactful && INFO.gnomad_popmax_af < 0.01 && variant.QUAL >= 20 && variant.ALT[0] != "*"' -o $outfile
In the command above, $vcf_input_file
is a placeholder for the VCF file being
processed. $annot_data_file
is the Ensembl Hg38 annotation file
(Homo_sapiens.GRCh38.108.gff3.gz
), and $gnomad_data_file
is either an
Ensembl-style (Hg38p14e.fa
) or UCSC-style (Hg38p14.fa
) Hg38 FASTA file. All
three reference files can be found in the distributed data directory.
bcftools
and slivar
can both be downloaded from the internet, but are also baked
into this container.
The container image along with the required data and models can be downloaded as an 84 GiB bundle from https://doi.org/10.26182/37wm-bz23.
This container can be run with any docker-archive compatible software (e.g.
docker, podman, apptainer). Here I will use podman
.
The image can be loaded with the load
command, then run.
gunzip apply-impprove.tar.gz
podman load --input apply-impprove.tar
podman run localhost/apply-impprove --help
It is also possible to create an Apptainer/Singularity container from this container image tarball, see https://apptainer.org/docs/user/latest/docker_and_oci.html#containers-in-docker-archive-files for more information.
apptainer build apply-impprove.sif docker-archive:apply-impprove.tar
To see the help of the main entrypoint, simply run the container with the --help
flag. As mentioned in the help information, this README can be retrieved by
running the container with the argument --readme
.
In order to apply IMPPROVE random forest models, this container needs two sets of data:
- IMPPROVE models to apply, and
- A small collection of datasets needed to prepare the input
The “small collection” of datasets consists of ~1 GB of tabular files found at
tkiHOHPC2111:/data/scratch/timothy/minimal_impprove_data/
, and a cadd.hdf5
file
(see tkiHOHPC2111:/data/scratch/timothy/cadd_small.hdf5/
). The cadd.hdf5
should
be put in (not symlinked) the minimal_impprove_data
folder. This can then be
mounted as the /data
volume.
podman run \
-v ~/Downloads/impprove_models/some_particular_set:/models \
-v ~/Downloads/minimal_impprove_data:/data \
-v ~/Documents/impprove_run/predictions:/predictions \
-v ~/Documents/impprove_run/vcfs:/vcfs \
localhost/apply-impprove
The /models
volume should be a directory of per-HPO models, and look something
like this:
/models ├── 2.jlso ├── 3.jlso ├── 8.jlso ⋮ └── 3000050.jlso
A selection of pre-trained models are available under
tkiHOHPC2111:/data/scratch/timothy/impprove_pretrained_models
.
For each input VCF, an output folder is created within the /predictions
volume.
Each folder will have a structure something like this:
/predictions/input_file_name.vcf ├── all-predictions.csv ├── all-preds.png ├── confusion-matrices.csv ├── hpo-descriptions.csv ├── model-scores.csv ├── top-50-preds.png└ ├── variable-importances.csv └── wmean-good-prediction.csv
all-predictions.csv
will contain information on each variant of the VCF,
starting with the Ensemble Gene ID and the gene name, and then after giving the
variant itself (in terms of the chromosome, location, ref and alt) each
subsequent “HP:XXXXXXX” column gives the predictions from the model build for
that HPO term. A counterpart to this is confusion-matrices.csv
, which provides
confusion matrices for each prediction, using it’s score as a threshold applied
to the OOB predictions.
Each model used has an associated self-evaluation score from the bootstrapped
test/train process, these scores are given in model-scores.csv
. The variable
importances for each model are found in variable-importances.csv
. Each model is
labelled according to its HPO id, descriptions for the HPO ids used can be found
in hpo-descriptions.csv
.
The score-weighted average prediction across models with a score of at least 0.4
is given in wmean-good-prediction.csv
.
top-50-preds.png
is a heatmap of all predictions for the top 50 variants
overall, while all-preds.png
is the heatmap for everything.
A small sample VCF is bundled with this container for testing, and can be
fetched by running the container with the argument --sample
. It contains three
known pathogenic variants, and 5000 benign variants.