The main script wrath.sh
runs the main version of Wrath. The pipeline calculates barcode sharing between windows of a chosen size in a given chromosome and produces heatmap plots of barcode sharing. The main pipeline can also automatically detect large SVs. Default window size is 50kbp.
To run this scripts barcodes need to be encoded in the BX tag beforehand. To do that you can use the barcode_parsing utility.
A typical command looks like:
wrath.sh -g reference_genome.fa -c chromosome_name -w 50000 -s list_of_bam_files.txt -t 15
Input options are:
wrath: wrapped analysis of tagged haplotypes
DESCRIPTION:
Program produces a jaccard matrix camparing the barcode content between all pairs windows whithin a chromosome.
wrath.sh [-h] [-g FASTAFILE] [-c CHROMOSOMENAME] [-w WINDOWSIZE] [-s FILELIST] [-t THERADS] [-p] [-v] [-x STEP] [-l]
OPTIONS:
-h show this help text
-g FASTAFILE reference genome
-c CHROMOSOMENAME chromosome
-w WINDOWSIZE window size. Default 50kbp
-s FILELIST list of bam files with paths of the individuals of the population/phenotype of interest
-t THERADS threads to use. Default 1.
-p skip plotting the heatmap
-x STEP start from a given step. Note that this only works if filenames match those expected by wrath. Possible step options are: makewindows, getbarcodes, matrix, outliers or plot
-l automatic detection of SVs
-v verbose (only for the matrix generating step)
Command line programs:
Python:
pip install -U numpy seaborn matplotlib pandas scikit-learn
R:
In R, run
# The easiest way to get ggplot2, tidyr and dplyr is to install the whole tidyverse:
install.packages("tidyverse")
#and then nlraa
install.packages("nlraa")
The input necessary is a reference genome in fasta format and a list of the sample bam files that need to be analysed including their paths.
Like:
/home/samples/bams/group1/sample1.bam
/home/samples/bams/group1/sample2.bam
/home/samples/bams/group2/sample3.bam
/home/samples/bams/group2/sample4.bam
WRATH will create a directory with this structure (when running with default options):
wrath_out/
├── beds
├── matrices
├── outliers
├── plots
└── SVs
Table of automatically detected SVs for a given chromosome (in csv format). Columns include: SV id, chromosome name, start position, end position and SV length.
SV_id,chromsome,start,end,length
5,Herato1910,10000,3170000,3160000
15,Herato1910,540000,2160000,1620000
13,Herato1910,260000,1110000,850000
19,Herato1910,90000,620000,530000
16,Herato1910,1600000,2130000,530000
6,Herato1910,2530000,2940000,410000
12,Herato1910,940000,1090000,150000
2,Herato1910,600000,750000,150000
0,Herato1910,4820000,4910000,90000
Heatmap plot of barcode sharing between windows of a given chromosome. If automatic detection of SVs is used (default option), the lower triangle of the plot will indicate the points where SVs have been detected, while the upper triangle will show barcode sharing between windows.
If automatic detection of SVs is disabled, upper and lower triangle will show the same data, barcode sharing between windows.
- Window beds: To calculate barcode sharing between windows, first Wrath splits the chromosome in n windows of size m. The coordinates of those windows are stored in a bed file in the directory beds.
- Barcode beds: Barcodes are extracted from the bam files and their leftmost mapping position stored in a gzipped bed file in beds. Tabix index are also created.
- Matrices: barcode sharing between pairs of windows is calculated and stored in an identity matrix of nxn dimensions. A jaccard index is calculated for each pair of windows:
Where J is the Jaccard distance, and A and B are windows 1 and and 2 respectively.
- Outliers: we calculate and store the distance of each comparison to the diagonal. Then using this distance and the jaccard index value of the comparison, we fit a double exponential decay model, such that:
The model is fit and 95% prediction bands are calculated from it such that:
Any points above or below the prediction bands are defined as outliers and stored in the outliers directory. Several values are stored for each outlier: row number, column number, jaccard distance, y estimate of the model, estimated error, 2.5 quantile, 97.5 quantile and a definition of whether it is an 'upper' or 'lower' outlier.
The easiest way to run Wrath on multiple chromosomes is to run in parallely. If running on a cluster and using a shceduling system such as SLURM, an array can be used to run a job for each chromosome. An example is found in example array.