/variant-scorer

A framework to score and analyze variant effects genome-wide using ChromBPNet models

Primary LanguagePythonMIT LicenseMIT

variant-scorer

The variant scoring repository provides a set of scripts for scoring genetic variants using a ChromBPNet model.


1. variant_scoring.py

This script takes a list of variants in various input formats and generates scores for the variants using a ChromBPNet model. The output is a TSV file containing the scores for each variant.

Usage:

python variant_scoring.py -l [VARIANTS_FILE] -g [GENOME_FASTA] -m [MODEL_PATH] -o [OUT_PREFIX] -s [CHROM_SIZES] [OTHER_ARGS]

Input arguments:


-l or --list: (required) a TSV file containing a list of variants to score

-g or --genome: (required) a genome fasta file

-pg or --peak_genome: a genome fasta file for peaks

-m or --model: (required) the ChromBPNet model to use for variant scoring. For most use cases, this should be the bias-corrected model (chrombpnet_nobias.h5)

-o or --out_prefix: (required) the path to store SNP effect score predictions from the script. The directory should already exist

-s or --chrom_sizes: (required) the path to a TSV file with chromosome sizes

-ps or --peak_chrom_sizes: the path to a TSV file with chromosome sizes for the peak genome

-dm or --debug_mode: subsample 10000 variants for debug

-bs or --batch_size: the batch size to use for the model. Default is 512

-sc or --schema: the format for the input variants list. Choices are: 'bed', 'plink', 'chrombpnet', 'original'. Default is 'chrombpnet'

-p or --peaks: a bed file containing peak regions

-n or --num_shuf: the number of shuffled scores per SNP. Default is 10

-t or --total_shuf: the total number of shuffled scores across all SNPs. Overrides --num_shuf

-c or --chrom: only score SNPs in the selected chromosome

-r or --random_seed: the random seed for reproducibility when sampling. Default is 1234

--no_hdf5: do not save detailed predictions in hdf5 file

-fo or --forward_only: run variant scoring only on forward sequence

-st or --shap_type: the type of SHAP values to compute. Default is "counts"

Supported Variant List Schemas:

  • chrombpnet : ['chr', 'pos', 'allele1', 'allele2', 'variant_id']
  • bed : ['chr', 'pos', 'end', 'allele1', 'allele2', 'variant_id']
  • plink : ['chr', 'variant_id', 'ignore1', 'pos', 'allele1', 'allele2']
  • original : ['chr', 'pos', 'variant_id', 'allele1', 'allele2']

2. variant_summary_across_folds.py

This script takes variant scores generated by the variant_scoring.py script and generates a TSV file with the mean scores for each score type.

Usage:

python variant_summary_across_folds.py -sd [VARIANT_SCORE_DIR] -sl [SCORE_LIST] -o [out_prefix] -s [SCHEMA]

Input arguments:


-sd or --score_dir (required): Path to directory with variant scores that will be used to generate summary

-sl or --score_list: (required): Names of variant score files that will be used to generate summary

-o or --out_prefix (required): Path prefix for storing the summary file with average scores across folds; directory should already exist

-sc or --schema: the format for the input variants list. Choices are: 'bed', 'plink', 'chrombpnet', 'original'. Default is 'chrombpnet'


3. variant_annotation.py

This script takes a list of variants and annotates each with their closest genes and any overlaps with peaks.

NOTE: This script assumes that the peaks and genes are in the same reference genome as the variants, and it does not perform any liftover operations.

Usage:

python variant_annotation.py -sd [VARIANT_SCORE_DIR] -o [out_prefix] -p [PEAKS] -g [GENES] -s [SCHEMA]

Input arguments:


-l or --list: (required) a TSV file containing a list of variants to annotate

-o or --out_prefix (required): Path prefix for storing the annotated file; directory should already exist

-p or --peaks (required): a bed file containing peak regions

-g or --genes: (required): A bed file with gene coordinates

-sc or --schema: the format for the input variants list. Choices are: 'bed', 'plink', 'chrombpnet', 'original'. Default is 'chrombpnet'


Note: pos (position) column is for 1-indexed SNP position, unless the schema is bed