danbing-tk
A toolkit for variable number tandem repeats (VNTRs) analysis, which enables:
- building repeat-pan genome graphs (RPGG) for a set of "genotypable" VNTRs given haplotype-resolved assemblies and matched short-read seqeuncing (SRS) data (
danbing-tk build
), - genotyping each VNTR as a set of (k-mer, count) given SRS data (
danbing-tk align
), and - predicting VNTR length from the genotype (
danbing-tk predict
) given locus-specific sampling bias (LSB).
See online manuscript for details.
Download Releases
Each release comes with the lastest version of genotypable VNTRs, RPGG (and/or) precomputed LSB.
File | Input of | Output of | |
---|---|---|---|
Genotypable VNTRs | tr.good.bed | danbing-tk build | |
Control regions | ctrl.bed | danbing-tk build | |
RPGG | pan.(tr.kmers|kmerDBi.umap|kmerDBi.vv|graph.umap) | danbing-tk align | danbing-tk build |
Precomputed LSB | LSB.tsv | danbing-tk predict |
- Release v1.3: Updated RPGG built from 35 HGSVC genomes.
- Release v1.0: VNTR summary statistics and eGene discoveries are also included. Example analyses such as differential length/motif analysis, eQTL mapping, VNTR locus QC, sample QC are also included.
Building on Linux
git clone https://github.com/ChaissonLab/danbing-tk
cd danbing-tk && make
danbing-tk align
Decompress the RPGG RPGG.tar.gz
in your working directory.
An example usage to genotype SRS sample using the RPGG:
samtools fasta -@2 -n $SRS.bam |
/$PREFIX/danbing-tk/bin/danbing-tk -gc 80 -ae -kf 4 1 -cth 45 -o $OUT_PREF -k 21 -qs pan -fai /dev/stdin -p $THREADS | gzip >$OUT_PREF.aln.gz
danbing-tk align
takes ~12 cpu hours to genotype a 30x SRS sample. This will generate $OUT_PREF.tr.kmers
and $OUT_PREF.aln.gz
output with format specified in File Format.
Important note: If outputs of danbing-tk align
are intended to be used directly for downstream analyses e.g. association tests, please check the distribution of LSB section below before running.
danbing-tk build
Install Dependencies
For users intended to use danbing-tk align
only, this step is not required.
The danbing-tk build
pipeline and danbing-tk predict
require several external packages. It is recommended to install all requirements using conda as follows:
conda install -c conda-forge -c bioconda -c intel \
snakemake=5.11.2 samtools=1.10 bedtools=2.29.2 minimap2=2.17 \
scikit-learn=0.23.1 statsmodels=0.12.1 pysam=0.15.3
If the requirements are in conflict with existing packages, create a new environment specifically for danbing-tk with:
conda create -n $MY_ENVIRONMENT -c conda-forge -c bioconda -c intel \
snakemake=5.11.2 samtools=1.10 bedtools=2.29.2 minimap2=2.17 \
scikit-learn=0.23.1 statsmodels=0.12.1 pysam=0.15.3
To check if everything is configured properly:
- Go to
/$PREFIX/danbing-tk/test/
- Replace
$PREFIX
ingoodPanGenomeGraph.json
andinput/genome.bam.tsv
with the path to danbing-tk - Run
snakemake -p -s ../pipeline/GoodPanGenomeGraph.snakefile -j 4 --rerun-incomplete --output-wait 3
Tested on v1.0.
Running danbing-tk build
-
Required inputs:
- haplotype-resolved assemblies (FASTA)
- matched SRS data (BAM; optional)
- reference genome (major chromosomes only without minor contigs)
- tandem repeat regions (BED; available on release page or user-defined)
-
Copy
/$PREFIX/danbing-tk/pipeline/goodPanGenomeGraph.json
to your working directory and edit accordingly. -
A config file
/$INPUT_DIR/genome.bam.tsv
with two columns, one for genome name and one for bam file path, is required if SRS data is available for graph pruning, e.g.HG00514 /panfs/qcb-panasas/tsungyul/HG00514/HG00514.IL.srt.bam
Otherwise, set
pruning
ingoodPanGenomeGraph.json
toFalse
and use a single column input forgenome.bam.tsv
. -
Run the snakemake pipline with:
snakemake -p -s /$PREFIX/danbing-tk/pipeline/GoodPanGenomeGraph.snakefile -j 40\
--cluster "{params.copts} -c {resources.cores} --mem={resources.mem}G -k" \
--rerun-incomplete --restart-times 1 --output-wait 30
Submitting jobs to cluster is preferred as danbing-tk build
is compute-intensive, ~1200 cpu hours for the original dataset. Otherwise, remove --cluster
and its parameters to run jobs locally.
danbing-tk predict
Locus-specific sampling biases (LSB) at VNTR regions are critical for normalizing the sum of k-mer counts to VNTR length. We provided precomputed LSB at the VNTR regions for fast comparison, however this assumes the LSB of the dataset of interest is close enough to the dataset in the original paper. Please ensure this assumption is valid by running a joint PCA on the LSB of non-repetitive regions with the original dataset, provided in LSB.tsv
. If this assumption failed, leave-one-out analysis (next section) on the dataset of interest is necessary to make accurate predictions. The following usage is for when the assumption holds.
Run getCovByLocus.397.sh
on your SRS dataset.
Run length prediction with:
/$PREFIX/danbing-tk/script/kmc2length.py --outdir OUTDIR --ksize KSIZE --kmers KMERS --trbed
TRBED --LSB LSB --cov COV --covbed COVBED
Length estimates are written to estimated_TR_len.tsv
.
Analysis
Distribution of LSB
To ensure technical variations are consistent between samples and do not introduce bias to VNTR genotypes, or k-mer counts, in a subset of samples. It is necessary to ensure the distribution of LSB shows up in one cluster. An example analysis is shown in LSB_analysis.ipynb
on the release page and script/getCovByLocus.397.sh
.
Leave-one-out analysis
To evaluate the quality of custom RPGG on matching SRS dataset, copy /$PREFIX/danbing-tk/pipeline/leaveOneOut.snakefile
to your working directory and edit accordingly.
Run the snakemake pipleine with:
snakemake -p -s /$PREFIX/pipeline/LeaveOneOut.snakefile -j 40 --cluster \
"{params.copts} -c {resources.cores} --mem={resources.mem}G -k" \
--rerun-incomplete --restart-times 1 --output-wait 30
Submitting jobs to cluster is preferred as danbing-tk build
is compute-intensive; otherwise, remove --cluster
and its parameters to run jobs locally.
File Format
- *.graph.kmers
>locus i
kmer0 out_edges0
kmer1 out_edges1
...
>locus i+1
...
out_edges
denotes the presence of T/G/C/A as the next nucleotide encoded with 4 bits.
- *.(tr|ntr).kmers
>locus i
kmer0 kmer_count0
kmer1 kmer_count1
...
>locus i+1
...
The second field is optional.
- Alignment output (
-a
option)- Synopsis
<src> <dest> <read_name/0> <read_seq/0> <ops/0> <read_name/1> <read_seq/1> <ops/1>
src
: source locus of a read pair (for simulation only)dest
: aligned locus for the read pairops
: operations to align the read to the graph=
: a match in the repeat.
: a match in the flank[A|C|G|T]
: a mismatch; letter in the graph is shown[0|1|2|3]
: a deletion; letter in the graph is shown as 0123 for ACGT, respectively.I
: an insertion in the readH
: a nucleotide in the homopolymer run*
: unalinged nucleotide
- Synopsis