/rad-gatk

Tools to obtain locus-based statistics from RADseq data analyzed with GATK

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

rad-gatk

Tools to obtain locus-based statistics from RADseq data analyzed with GATK

Brief description

GATK was designed to map whole genome shotgun reads (randomly fragmented DNA) to a reference genome in order to discover and call variants. However, reads from a RAD experiment can also be used with some modifications in the GATK preliminary steps. Most notably, we must skip the "Eliminate duplicates" step since we would loose most of the coverage per locus.

GATK outputs VCF files where only data about individual SNPs are reported without reference to each specific RAD locus, these scripts attempt to recover information based on each particular RAD locus based on the genomic coordinates of the SNPs determining if adjacent SNPs belong to a single RAD locus or not.

Usage

Each script has few options, for example the minimum read depth to consider a locus as valid, or the mean locus length of your RAD loci (e.g. if you sequenced PE 2x150 you should probably have a mean locus length of ~300bp), to access help on each parameter just type python gatk-rad-loci-stats-s3.py -h :

usage: gatk-rad-loci-stats-s3.py [-h] -i FILENAME [-d MINDEPTH] [-l MEANLOCUSLEN]

Get RAD loci statistics from GATK (within sample = (i)pyrad's step 3)

optional arguments:
  -h, --help            show this help message and exit
  -i FILENAME, --input FILENAME
                        Name of input file (the VCF matrix grom GATK within a
                        sample)
  -d MINDEPTH, --min-locus-depth MINDEPTH
                        Minimum number of reads in a RAD locus, default=6
  -l MEANLOCUSLEN, --mean-locus-length MEANLOCUSLEN
                        Mean RAD locus length in basepairs, default=270

or python gatk-rad-loci-stats-s7.py -h:

usage: gatk-rad-loci-stats-s7.py [-h] -i FILENAME [-d MINDEPTH] [-s MINSAMPLE]
                          [-l MEANLOCUSLEN]

Get loci statistics from GATK (across samples = (i)pyrad's step 7)

optional arguments:
  -h, --help            show this help message and exit
  -i FILENAME, --input FILENAME
                        Name of input file (the VCF matrix grom GATK across
                        samples)
  -d MINDEPTH, --min-locus-depth MINDEPTH
                        Minimum number of reads in a RAD locus, default=6
  -s MINSAMPLE, --min-samples MINSAMPLE
                        Minimum number of samples in a locus, default=4
  -l MEANLOCUSLEN, --mean-locus-length MEANLOCUSLEN
                        Mean RAD locus length in basepairs, default=270

Examples

Example 1: If you want to obtain RAD locus information from sample_1 for loci covered at least 10x and assuming a mean locus length of 300bp:

python gatk-rad-loci-stats-s3.py -i sample_1.vcf.table -d 10 -l 300

Example 2: If you want statistics per RAD locus across samples, for loci represented by at least 30 samples, and assuming a mean locus length of 250bp:

python gatk-rad-loci-stats-s7.py -i across_samples.vcf.table -s 30 -l 250

Credits