Scripts used in "An introgressed gene causes meiotic drive in Neurospora sitophila" by Svedberg et al.
Reference: https://www.biorxiv.org/content/10.1101/2020.01.29.923946v1
This script will convert a GATK VariantsToTable file to an aligned fasta.
Usage:
variantTab2fasta.py --help
usage: variantTab2fasta.py [-h] [-g GENOME] [-n] [-c CONTIGS] [-r REFERENCE]
[-o OUTPUT] [-s SPLITTER]
filename
Script that converts a GATK VariantsToTable file to an aligned fasta.
positional arguments:
filename Variant table file name
optional arguments:
-h, --help show this help message and exit
-g GENOME, --genome GENOME
Genome fasta file.
-n, --nomissing Outputs sites missing in the VCF file as dashes (-).
-c CONTIGS, --contigs CONTIGS
List of contigs to export in quotations.
-r REFERENCE, --reference REFERENCE
Includew reference genome. Reference strain name as
parameter.
-o OUTPUT, --output OUTPUT
Output file name.
-s SPLITTER, --splitter SPLITTER
Character that splits out strain name. Default=.
This script performs Fisher's exact test on each site in order to perform a genome wide association test. You need three VCF files with genome wide SNP data of all samples, group 1 and group 2. If you have a single VCF file at the start, the workflow is as follows:
Create an allele counts file from the original VCF file. Then create text files containing the names of all individuals that belong to each group and use thes files to create allele counts files with only these individuals.
vcftools --vcf all_individuals.vcf --counts --out all_individuals
vcftools --vcf all_individuals.vcf --counts --keep group1.txt --out only_group1
vcftools --vcf all_individuals.vcf --counts --keep group2.txt --out only_group2
Then merge all three VCF files column wise using paste:
paste all_individuals.frq.counts only_group1.frq.count only_group2.frq.count > all.counts
Then extract biallelic sites:
awk '$3 == 2' all.counts > all.counts.biallelic
Then use fisherVCFcounts.py
to perform Fisher's exact test on each site.
fisherVCFcounts.py all.counts.biallelic > all.counts.biallelic.fisher
You can also sort this file based on p-value to easily find outliers:
sort -k25g all.counts.biallelic.fisher > all.counts.biallelic.fisher.sorted