This repository contains details on the data processing and analysis steps used to study the role of sexual selection in speciation through 1) the characterization of the genetic architecture of sexual signal traits in barn swallows (Hirundo rustica), 2) tests of divergent sexual selection in geographic isolation, 3) tests of selection against gene flow at trait loci in secondary contact, and 4) tests that selection has maintained associations between trait loci, known as genetic coupling, leading to reproductive isolation due to sexual selection. This workflow is a companion to the methods described in Schield et al. (in review).
Lists and reference files can be found in various analysis directories, as can Shell and Python scripts used in elements of the analysis workflow. R scripts are in the R
directory. Note that you may need to adjust the organization of path/file locations to suit your environment. This workflow assumes that you return to the main working directory ~/hirundo_speciation_genomics
after completing major analysis sections. Adjust the path before the main working directory as necessary. Walkthrough of software installations will take place in ~/hirundo_speciation_genomics/tmp/
.
Feel free to contact me at drew.schield[at]colorado.edu with any questions.
- Software and dependencies
- Genome data processing and variant calling
- Mapping statistics
- PCA
- ADMIXTURE
- Hybrid index and heterozygosity
- Demographic inference
- Genotype-phenotype associations
- Recombination rate
- Population genetic diversity and differentiation
- Population branch statistics
- Tajima's D
- Haplotype statistics
- Geographic cline analysis
- Genomic cline analysis
- Linkage disequilibrium: tests of genetic coupling
- Linkage disequilibrium: decay
- Appendix
The analysis sections below use the following software and dependencies and assume they are on the user path unless otherwise specified:
- Trimmomatic
- bwa
- htslib
- Samtools
- bgzip
- tabix
- GATK
- Picard
- bcftools
- Plink
- vcftools
- bedtools
- SNPRelate
- ADMIXTURE
- Introgress
- dadi
- easySFS
- GEMMA
- Beagle
- SMC++
- pyrho
- pixy
- VCF-kit
- SHAPEIT
- rehh
- HZAR
- bgc
- MashMap
- Repeatmasker
- R
Note, I installed a number of these programs to my conda environment, or installed via a virtual environment (details below).
This section includes details on steps to filter raw fastq data, map reads to the reference genome, call variants among samples, and filter variant calls. Raw read data are available from NCBI SRA, and should be placed in the fastq
subdirectory below. Additional information about the reference genome and annotation can be found in the Appendix.
Note: scripts used in this section are in the scripts
directory (for tidiness) and will need to be moved to the working directory prior to running this part of the workflow.
We'll set up directories for read data, mapping data, and variant calls. The log
subdirectory will contain sample lists, log files, etc.
mkdir fastq
mkdir fastq_filtered
mkdir bam
mkdir gvcf
mkdir vcf
mkdir log
We'll use Trimmomatic
to filter and quality trim raw read data, using the settings:
- Remove 5' end bases if quality is below 20
- Remove 3' end bases if quality is below 20
- Minimum read length = 32
- Remove reads if average quality is < 30
We'll process the input data slightly differently for data generated in 2018 and 2021, based on some slight differences in naming conventions between sequencing runs.
Format ./log/trimmomatic.2018.list
and ./log/trimmomatic.2021.list
.
- Run
runTrimmomatic.2018.sh
to process 2018 fastq data.
sh runTrimmomatic.2018.sh > ./log/runTrimmomatic.2018.log
- Run
runTrimmomatic.2021.sh
to process 2021. fastq data.
sh runTrimmomatic.2021.sh > runTrimmomatic.2021.log
- Run
trimmomatic
on outgroup H. smithii individual.
trimmomatic PE -phred33 -threads 16 ./fastq/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_1.fq.gz ./fastq/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_2.fq.gz ./fastq_filtered/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_1_P.trim.fq.gz ./fastq_filtered/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_1_U.trim.fq.gz ./fastq_filtered/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_2_P.trim.fq.gz ./fastq_filtered/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_2_U.trim.fq.gz LEADING:20 TRAILING:20 MINLEN:32 AVGQUAL:30
We won't map these.
cd fastq_filtered
rm ./*U.trim.fq.gz
cd ..
Analysis-read paired read data are in fastq_filtered
.
We'll run BWA mem
on each sample to map it to the reference genome.
-
Format
./log/bwa_mem.list
with all ingroup samples. -
The script
runBWAmem.sh
runs BWAmem
and also indexes usingsamtools
.
sh runBWAmem.sh > ./log/runBWAmem.log
bwa mem -t 16 -R "@RG\tID:RS_5\tLB:Hirundo\tPL:illumina\tPU:NovaSeq6000\tSM:RS_5" Hirundo_rustica_bHirRus1.final.fasta ./fastq_filtered/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_1_P.trim.fq.gz ./fastq_filtered/RS_5_USPD16098865-AK1579-AK402_HL7CWDSXX_L1_2_P.trim.fq.gz | samtools sort -@ 16 -O bam -T RS_5.temp -o ./bam/RS_5.bam -
samtools index ./bam/RS_5.bam
Analysis-ready bam files are in ./bam
.
We'll call individual variants using HaplotypeCaller
and cohort variants using GenotypeGVCFs
.
Run runGATKhaplotypecaller.sh to run HaplotypeCaller
on samples in ./log/GATKhaplotypecaller.list
.
sh runGATKhaplotypecaller.sh ./log/GATKhaplotypecaller.list > ./log/runGATKhaplotypecaller.log
The output genomic VCFs for each individual are in ./gvcf/
.
Note, this takes a small eternity to run on all samples in sequence. We can split out the job to parallel processes by breaking up the list of samples and running the script on each list separately (constrained by CPU threads).
-
Format
./log/sample.hrustica+smithii.gvcf.list
, with the list of paths to all input gvcf files. -
Run
runTabix.sh
to index all input gvcf files.
sh tabix.sh ./log/sample.hrustica+smithii.gvcf.list > ./log/tabix.log
-
Format intervals files with lists of scaffolds to run parallel analyses on in
./log/scaffold.list1.intervals
, etc. (13 files total). These are split out based on the largest scaffold, which represents ~14% of the genome. -
Run GATK
GenotypeGVCFs
to call cohort variants on the scaffold intervals.
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list1.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list1.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list1.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list2.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list2.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list2.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list3.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list3.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list3.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list4.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list4.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list4.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list5.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list5.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list5.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list6.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list6.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list6.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list7.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list7.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list7.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list8.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list8.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list8.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list9.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list9.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list9.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list10.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list10.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list10.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list11.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list11.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list11.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list12.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list12.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list12.vcf.log
java -jar ./gatk-3.8-1-0/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ./Hirundo_rustica_bHirRus1.final.fasta -L ./log/scaffold.list13.intervals -V ./sample.hrustica+smithii.gvcf.list -allSites -o ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list13.vcf.gz > ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list13.vcf.log
- Check on running progress.
for i in ./log/GenotypeGVCFs.hirundo_rustica+smithii.allsites.raw.scaffold.list*.log; do echo $i; grep 'INFO' $i | tail -n 1; done
We'll filter to retain a set of high-quality SNP calls, while also maintaining an 'all-sites' VCF through most steps, enabling calculation of statistics requiring both variant and invariant genotypes. We'll also apply several downstream filters for specific analyses (e.g., filtering by minor allele frequency or thinning by physical distance, etc.).
We have raw variant calls in 13 separate interval files from the step above. We'll also set hard filters on genotypes in parallel to reduce runtime, then merge the VCFs.
- Run GATK
VariantFiltration
to set hard filters.
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list1.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list1.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list1.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list2.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list2.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list2.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list3.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list3.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list3.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list4.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list4.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list4.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list5.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list5.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list5.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list6.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list6.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list6.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list7.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list7.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list7.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list8.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list8.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list8.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list9.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list9.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list9.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list10.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list10.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list10.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list11.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list11.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list11.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list12.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list12.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list12.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list13.vcf.gz -filter "QD < 2.0" --filter-name "QD2" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.list13.vcf.gz > ./log/GATK_VariantFiltration.scaffold.list13.log
-
Format
./log/vcf.HardFilter.list
to direct to files to be merged. -
Run Picard
MergeVCFs
to merge the VCF files.
java -jar picard.jar MergeVcfs -I ./log/vcf.HardFilter.list -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.vcf.gz
- Remove raw interval VCFs taking up disk space.
rm ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.list*
- Mask indels and hard filtered genotypes.
bcftools filter --threads 20 -e 'TYPE="indel" || FILTER="QD2" || FILTER="FS60" || FILTER="MQ40" || FILTER="MQRankSum-12.5" || FILTER="ReadPosRankSum-8"' --set-GTs . -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.vcf.gz
bcftools index --threads 48 -t ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.vcf.gz
- Query a subset of the filtered VCF to quantify proportion of missing genotypes across samples.
bcftools view --threads 16 -r NC_053451.1 -m2 -M2 -U -v snps -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.snps.chr1.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.vcf.gz > ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.snps.chr1.log
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.snps.chr1.vcf.gz --missing-indv --out ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.snps.chr1
rm ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.snps.chr1.vcf.gz
Based on this analysis, samples HRUWBM90116 and HRUWBM87931 had > 75% missing data.
-
Format
./log/sample.remove.list
. -
Remove samples in list.
bcftools view --threads 16 -S ^./log/sample.remove.list -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.vcf.gz
Females are hemizygous for the Z chromosome, and so cannot have heterozygous sites on Z-linked scaffolds. We'll extract SNPs for Z-linked scaffolds, then identify SNPs in which any known females have heterozygous genotypes, then mask these sites in all individuals. We also want to remove males from the W chromosome, as they don't have one, and also to mask heterozygous sites on the W in females, since these can't exist.
We'll begin by extracting VCF tables for chromosome-assigned autosome, Z, and W scaffolds, respectively. From here forward we'll also omit un-assigned scaffolds from analysis.
- Format autosome, Z chromosome, and W chromosome-assigned scaffold BED files.
./log/scaffold.assigned-autosome.bed
./log/scaffold.assigned-chrZ.bed
./log/scaffold.assigned-chrW.bed
- Extract VCFs based on BED files.
bcftools view --threads 16 -R ./log/scaffold.assigned-autosome.bed -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.auto.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.vcf.gz
bcftools view --threads 16 -R ./log/scaffold.assigned-chrZ.bed -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.vcf.gz
bcftools view --threads 16 -R ./log/scaffold.assigned-chrW.bed -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.vcf.gz
- Rename and index autosome all-sites VCF.
mv ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.auto.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.auto.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.final.auto.vcf.gz
- Mask female heterozygous sites on the Z chromosome.
- Extract biallelic SNPs on Z-linked scaffolds.
bcftools view --threads 16 -m2 -M2 -U -v snps -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.vcf.gz > ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.snps.chrZ.vcf.gz
-
Format
./log/sample.female.list
. -
Run
ZchrFemaleHeterozygous.py
to identify and write positions of the Z chromosome with heterozygous genotype calls in any females.
gunzip ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.vcf.gz
python ZchrFemaleHeterozygous.py ./log/sample.female.list ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.vcf ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.FemaleZhetSites.txt
- Convert output to BED format and index it as a GATK feature file.
awk 'BEGIN{OFS="\t"}{print $1,$2-1,$2}' ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.FemaleZhetSites.txt > ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.FemaleZhetSites.bed
./gatk-4.0.8.1/gatk IndexFeatureFile --feature-file ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.FemaleZhetSites.bed
- Run GATK
VariantFiltration
to annotate female heterozygous sites and mask withbcftools
.
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.vcf.gz
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.vcf.gz --mask ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.snps.FemaleZhetSites.bed --mask-name ZHET -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.filter.vcf.gz > ./log/GATK_VariantFiltration_hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.filter.log
bcftools filter --threads 16 -e 'FILTER="ZHET"' --set-GTs . -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrZ.filter.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz
- Mask female heterozygous sites on the W chromosome.
- Remove males from W chromosome VCF.
bcftools view --threads 16 -S ./log/sample.female.list -O z -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.vcf.gz
- Extract W-linked SNPs.
bcftools view --threads 16 -m2 -M2 -U -v snps -O v -o ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.snps.vcf ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.vcf.gz
- Identify heterozygous sites on the W chromosome.
python ZchrFemaleHeterozygous.py ./log/sample.female.list ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.snps.vcf ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.snps.FemaleHetSites.txt
- Convert output to BED format and index it as a GATK feature file.
awk 'BEGIN{OFS="\t"}{print $1,$2-1,$2}' ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.snps.FemaleHetSites.txt > ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.snps.FemaleHetSites.bed
./gatk-4.0.8.1/gatk IndexFeatureFile --feature-file ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.snps.FemaleHetSites.bed
- Run GATK
VariantFiltration
to annotate female heterozygous sites and mask withbcftools
.
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.vcf.gz
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.vcf.gz --mask ./log/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.snps.FemaleHetSites.bed --mask-name WHET -O ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.filter.vcf.gz > ./log/GATK_VariantFiltration_hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.filter.log &
bcftools filter --threads 16 -e 'FILTER="WHET"' --set-GTs . -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrW.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.HardFilter.recode.indv.chrW.female.filter.vcf.gz &
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.final.chrW.vcf.gz
The steps to annotate repeats using RepeatMaster are detailed in the Appendix.
- Index repeat BED file.
./gatk-4.0.8.1/gatk IndexFeatureFile --feature-file ./genome_annotation/repeatmasker/Hirundo_rustica_bHirRus1.final.fasta.repeat.bed
- Annotate repeats using GATK
VariantFiltration
.
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.final.auto.vcf.gz --mask ./genome_annotation/repeatmasker/Hirundo_rustica_bHirRus1.final.fasta.repeat.bed --mask-name REP -O ./vcf/hirundo_rustica+smithii.allsites.final.auto.tmp-rep.vcf.gz > ./log/GATK_VariantFiltration_repeat-mask.hirundo_rustica+smithii.auto.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz --mask ./genome_annotation/repeatmasker/Hirundo_rustica_bHirRus1.final.fasta.repeat.bed --mask-name REP -O ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.tmp-rep.vcf.gz > ./log/GATK_VariantFiltration_repeat-mask.hirundo_rustica+smithii.chrZ.log
./gatk-4.0.8.1/gatk VariantFiltration -V ./vcf/hirundo_rustica+smithii.allsites.final.chrW.vcf.gz --mask ./genome_annotation/repeatmasker/Hirundo_rustica_bHirRus1.final.fasta.repeat.bed --mask-name REP -O ./vcf/hirundo_rustica+smithii.allsites.final.chrW.tmp-rep.vcf.gz > ./log/GATK_VariantFiltration_repeat-mask.hirundo_rustica+smithii.chrW.log
- Recode repeats as missing genotypes and index VCFs.
bcftools filter --threads 24 -e 'FILTER="REP"' --set-GTs . -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.auto.tmp-rep.vcf.gz
bcftools filter --threads 24 -e 'FILTER="REP"' --set-GTs . -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.tmp-rep.vcf.gz
bcftools filter --threads 24 -e 'FILTER="REP"' --set-GTs . -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrW.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrW.tmp-rep.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.final.auto.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.final.chrW.vcf.gz
- Parse chromosome-specific all-sites VCFs.
- Set up environment
cd vcf
mkdir chrom-specific
cd ..
-
Format chromosome-scaffold table for reference:
./log/chromosome-scaffold.table.txt
-
Run
parseVCFchrW.sh
to parse W chromosome scaffolds.
sh parseVCFchrW.sh
- Run
parseVCFchrZ.sh
to parse Z chromosome scaffolds.
sh parseVCFchrZ.sh
- Run
parseVCFauto.sh
to parse autosome scaffolds.
sh parseVCFauto.sh
- Extract biallelic SNPs.
bcftools view --threads 16 -m2 -M2 -U -v snps -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto.snps.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.auto.vcf.gz
bcftools view --threads 16 -m2 -M2 -U -v snps -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz
bcftools view --threads 16 -m2 -M2 -U -v snps -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrW.vcf.gz
- Extract SNPs with < 20% missing data.
bcftools view -i 'F_MISSING<0.2' -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.auto.snps.vcf.gz
bcftools view -i 'F_MISSING<0.2' -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.vcf.gz
bcftools view -i 'F_MISSING<0.2' -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.vcf.gz
- Impose minor allele frequency and minor allele count filters.
bcftools view --threads 16 -c 2 -q 0.05:minor -m2 -M2 -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.maf05.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.vcf.gz
bcftools view --threads 16 -c 2 -q 0.05:minor -m2 -M2 -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.maf05.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.vcf.gz
bcftools view --threads 16 -c 2 -q 0.05:minor -m2 -M2 -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.maf05.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.vcf.gz
- Remove H. smithii outgroup.
bcftools view --threads 16 -s ^RS_5 -c 2 -q 0.05:minor -m2 -M2 -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.maf05.ingroup.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.maf05.vcf.gz
bcftools view --threads 16 -s ^RS_5 -c 2 -q 0.05:minor -m2 -M2 -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.maf05.ingroup.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.maf05.vcf.gz
bcftools view --threads 16 -s ^RS_5 -c 2 -q 0.05:minor -m2 -M2 -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.maf05.ingroup.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.maf05.vcf.gz
These filtering steps produce many intermediate VCF files, especially in the initial parallel processing steps for raw and hard filtered VCFs. Delete these to save some disk space.
rm ./vcf/hirundo_rustica+smithii.allsites.raw.scaffold.*
rm ./vcf/hirundo_rustica+smithii.allsites.HardFilter.scaffold.*
mkdir analysis
cd analysis
mkdir mapping_statistics
cd ..
Run samtoolsStat.sh
from the scripts
directory. The results are written to ./analysis/mapping_statistics
.
sh ./scripts/samtoolsStat.sh
for i in ./analysis/mapping_statistics/*.stat.txt; do name=`echo $i | cut -d. -f1`; bases=`grep 'bases mapped (cigar):' $i | cut -f 3`; echo -e "$name\t$bases"; done
We'll examine population genetic structure within barn swallows using the R package SNPRelate
.
cd ./analysis
mkdir pca
cd pca
We'll run PCA on autosomal and Z-linked SNPs after performing several filtering steps:
- Remove high missing data samples
- Remove mislabeled individuals from the sequencing step
- Retail SNPs with a minor allele frequency of >= 0.1
- Merge autosomal and Z-linked VCFs.
bcftools concat -O z -o hirundo_rustica+smithii.allsites.final.snps.miss02.maf05.ingroup.auto+chrZ.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.maf05.ingroup.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.maf05.ingroup.vcf.gz
- Remove samples, filter by minor allele frequency.
There were a small number of savignii and erythrogaster samples (6) that were mislabeled during sequencing library preparation, making it unclear which subspecies they belonged to. There was also an inflection point in the proportion of missing genotypes within samples at ~40%. We'll remove the mislabeled and high missing data samples from further analysis.
Format sample.remove.list
.
Remove samples in the list and filter to retain SNPs with minor allele frequency >= 0.1.
bcftools view --threads 16 -S ^sample.remove.list -c 2 -q 0.1:minor -m2 -M2 -O z -o hirundo_rustica+smithii.allsites.final.snps.miss02.maf1.ingroup.indv.auto+chrZ.vcf.gz hirundo_rustica+smithii.allsites.final.snps.miss02.maf05.ingroup.auto+chrZ.vcf.gz
- Sample a million SNPs at random for analysis.
bcftools query -f '%CHROM\t%POS\n' hirundo_rustica+smithii.allsites.final.snps.miss02.maf1.ingroup.indv.auto+chrZ.vcf.gz | shuf -n 1000000 > sites.subset.list
There were actually only 837,275 SNPs after applying the allele frequency filter. Plenty for analysis.
We'll run the PCA in R using the script ~/hirundo_speciation_genomics/R/pca.R
.
We'll estimate individual admixture proportions between one or more genetic clusters using admixture
.
cd ./analysis/
mkdir admixture
cd admixture
mkdir input
plink --vcf ../pca/hirundo_rustica+smithii.allsites.final.snps.miss02.maf1.ingroup.indv.auto+chrZ.vcf.gz --make-bed --out hirundo.auto+chrZ --allow-extra-chr --recode12
The program can't take non-integer scaffold names.
sed -i.bak -e 's/NC_0//g' ./input/hirundo.auto+chrZ.bim
sed -i.bak -e 's/NW_0//g' ./input/hirundo.auto+chrZ.bim
sed -i.bak -e 's/\.1//g' ./input/hirundo.auto+chrZ.bim
sed -i.bak -e 's/NC_0//g' ./input/hirundo.auto+chrZ.map
sed -i.bak -e 's/NW_0//g' ./input/hirundo.auto+chrZ.map
sed -i.bak -e 's/\.1//g' ./input/hirundo.auto+chrZ.map
Run runAdmixture.sh
to run ADMIXTURE across a series of K values 1-10.
sh runAdmixture.sh ./input/hirundo.auto+chrZ.ped
grep -h CV log*.out
We'll compare individual hybrid index with interspecific heterozygosity at ancestry-informative sites to characterize our current sampling of hybrid zone populations. Elsie Shogren kindly shared her R Markdown detailing her procedure for doing this analysis in her system (thank you, Elsie!).
cd ./analysis
mkdir introgress
cd introgress
mkdir fst
mkdir input
We'll use Fst to determine which ancestry-informative SNPs to include in analysis.
- Format parental, hybrid, and parental+hybrid population maps.
popmap.gutturalis
popmap.rustica
popmap.tytleri
popmap.rustica-tytleri
popmap.rustica-gutturalis
popmap.tytleri-gutturalis
popmap.rustica-hybrids-tytleri
popmap.rustica-hybrids-gutturalis
popmap.tytleri-hybrids-gutturalis
- Merge autosomal and Z chromosome VCFs for biallelic SNPs.
bcftools concat --threads 24 -O z -o ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.maf05.ingroup.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.maf05.ingroup.vcf.gz
- Calculate Fst between parental populations.
vcftools --gzvcf ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz --weir-fst-pop popmap.rustica --weir-fst-pop popmap.tytleri --out ./fst/fst_rustica-tytleri
vcftools --gzvcf ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz --weir-fst-pop popmap.rustica --weir-fst-pop popmap.gutturalis --out ./fst/fst_rustica-gutturalis
vcftools --gzvcf ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz --weir-fst-pop popmap.tytleri --weir-fst-pop popmap.gutturalis --out ./fst/fst_tytleri-gutturalis
Here, we'll look for SNPs with Fst >= 0.6 between parental populations. Barn swallows have very few fixed differences, as shown in previous studies (e.g., Safran et al. 2016; Scordato et al. 2017; Schield et al. 2021), but using high-Fst SNPs should provide sufficient information.
- Format lists of high-Fst SNPs.
tail -n+2 ./fst/fst_rustica-tytleri.weir.fst | awk 'BEGIN{OFS="\t"}{if ($3 >=0.6) print $1,$2}' > ./fst/anc-info.snps.rustica-tytleri.txt
tail -n+2 ./fst/fst_rustica-gutturalis.weir.fst | awk 'BEGIN{OFS="\t"}{if ($3 >=0.6) print $1,$2}' > ./fst/anc-info.snps.rustica-gutturalis.txt
tail -n+2 ./fst/fst_tytleri-gutturalis.weir.fst | awk 'BEGIN{OFS="\t"}{if ($3 >=0.6) print $1,$2}' > ./fst/anc-info.snps.tytleri-gutturalis.txt
- Remove SNPs with completely missing data in hybrids.
Pilot analyses revealed that there are a small number of SNPs with completely missing genotypes in hybrid tytleri-gutturalis, which throws an error in downstream introgress
analysis.
Identify which SNPs are causing the problem.
bcftools view -i 'F_MISSING=0' ./input/anc-info.snps.tytleri-gutturalis_hybrids.vcf.gz | bcftools query -f '%CHROM\t%POS\n'
This outputs:
NC_053488.1 16368106
NC_053488.1 16368109
NC_053488.1 16368122
Note, the queried file in the command above was generated after extracting the SNP set for the tytleri-gutturalis hybrids.
Remove these SNPs from the parental input file.
grep -vwE "(16368106|16368109|16368122)" ./fst/anc-info.snps.tytleri-gutturalis.txt > ./fst/tmp.anc-info.snps.tytleri-gutturalis.txt
rm ./fst/anc-info.snps.tytleri-gutturalis.txt
mv ./fst/tmp.anc-info.snps.tytleri-gutturalis.txt ./fst/anc-info.snps.tytleri-gutturalis.txt
- Extract ancestry-informative SNPs for parentals and hybrids.
bcftools view --threads 16 -S popmap.rustica -R ./fst/anc-info.snps.rustica-tytleri.txt -O z -o ./input/anc-info.snps.rustica-tytleri_rustica.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.tytleri -R ./fst/anc-info.snps.rustica-tytleri.txt -O z -o ./input/anc-info.snps.rustica-tytleri_tytleri.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.rustica-tytleri -R ./fst/anc-info.snps.rustica-tytleri.txt -O z -o ./input/anc-info.snps.rustica-tytleri_hybrids.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.rustica -R ./fst/anc-info.snps.rustica-gutturalis.txt -O z -o ./input/anc-info.snps.rustica-gutturalis_rustica.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.gutturalis -R ./fst/anc-info.snps.rustica-gutturalis.txt -O z -o ./input/anc-info.snps.rustica-gutturalis_gutturalis.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.rustica-gutturalis -R ./fst/anc-info.snps.rustica-gutturalis.txt -O z -o ./input/anc-info.snps.rustica-gutturalis_hybrids.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.tytleri -R ./fst/anc-info.snps.tytleri-gutturalis.txt -O z -o ./input/anc-info.snps.tytleri-gutturalis_tytleri.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.gutturalis -R ./fst/anc-info.snps.tytleri-gutturalis.txt -O z -o ./input/anc-info.snps.tytleri-gutturalis_gutturalis.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.tytleri-gutturalis -R ./fst/anc-info.snps.tytleri-gutturalis.txt -O z -o ./input/anc-info.snps.tytleri-gutturalis_hybrids.vcf.gz /media/drewschield/VernalBucket/hirundo/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
We'll run the analysis in R using the script ~/hirundo_speciation_genomics/R/introgress.R
.
We'll infer two-population demographic histories and the relative timing of secondary contact between parental populations using modifications to dadi
described in Rougemont et al. 2017 and available via GitHub. Preliminary analyses revealed that our sample sizes from the whole genome dataset were insufficient to resolve demographic histories following very recent divergence (consistent with simulation-based results from Robinson et al. 2014). To capitalize on a larger sample size, we'll reanalyze RADseq data from Scordato et al. 2017 and Scordato et al. 2020, which were demutiplexed and quality trimmed as described in Schield et al. 2021.
cd ./analysis
mkdir dadi
cd dadi
mkdir fastq_filtered
mkdir bam
mkdir gvcf
mkdir vcf
mkdir log
mkdir dadi_rustica-tytleri
mkdir dadi_rustica-gutturalis
mkdir dadi_tytleri-gutturalis
Distribute copies of the Rougemont et al. dadi
modifications into each of the dadi_...
subdirectories.
git clone https://github.com/QuentinRougemont/DemographicInference.git
mv DemographicInference dadi_rustica-tytleri
git clone https://github.com/QuentinRougemont/DemographicInference.git
mv DemographicInference dadi_rustica-gutturalis
git clone https://github.com/QuentinRougemont/DemographicInference.git
mv DemographicInference dadi_tytleri-gutturalis
The filtered fastq data for the RADseq data from previous studies need to be in ./fastq_filtered
. We already have a gVCF for the H. smithii outgroup, which we will incorporate later.
-
Format
sample.filter.popmap
. -
Format
sample.filter.list
.
cut -f1 sample.filter.popmap > sample.filter.list
- Run
runBWAmem.sh
to map to the reference.
sh runBWAmem.sh > ./log/runBWAmem.log
- Format
./bam/bam.list
file.
- Temporarily increase the limit on number of open files to accommodate bam files.
ulimit -n 1000
- Call variants using bcftools
mpileup
andcall
.
cd ./bam
bcftools mpileup --threads 16 -Ou -B -a "DP,AD" -b bam.list -f ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.fasta | bcftools call --threads 16 -c -v -V indels -f "GQ" -Oz -o ../vcf/hirundo_rustica.rad.snps.tmp.vcf.gz
tabix -p vcf ../hirundo_rustica.rad.snps.tmp.vcf.gz
cd ..
Note: the '-a "DP,AD"' flag is very important; mpileup
does not output individual per-base depth statistics by default, and we will use a DP filter downstream to recode and cull missing data among samples, and ultimately to filter SNPs with high proportions of missing data.
-
Format
scaff.list
with list of scaffolds to retain. -
Run bcftools
+setGT
to set low depth SNPs as missing.
bcftools +setGT ./vcf/hirundo_rustica.rad.snps.tmp.vcf.gz -- -t q -n . -i 'FMT/DP<5' | bcftools view --threads 16 -O z -o ./vcf/hirundo_rustica.rad.snps.tmp.missing.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica.rad.snps.tmp.missing.vcf.gz
- Run bcftools
filter
to filter to ordered autosomes and remove SNPs with > 20% missing data.
bcftools view --threads 16 -R scaff.list -m2 -M2 -U -v snps -i 'F_MISSING<0.2' -O z -o ./vcf/hirundo_rustica.rad.snps.auto.miss02.vcf.gz ./vcf/hirundo_rustica.rad.snps.tmp.missing.vcf.gz
tabix -p vcf ./vcf/hirundo_rustica.rad.snps.auto.miss02.vcf.gz
- Extract SNP data for H. smithii.
bcftools view --threads 16 -s RS_5 -O z -o ./vcf/hirundo_smithii.auto.snps.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.vcf.gz
tabix -p vcf ./vcf/hirundo_smithii.auto.snps.vcf.gz
-
Format list of paths to VCFs to merge
vcf.merge.list
. -
Run bcftools
merge
to combine ingroup and outgroup VCFs and filter to retain SNPs with < 20% missing data.
bcftools merge --threads 16 -l vcf.merge.list | bcftools view --threads 16 -m2 -M2 -U -v snps -i 'F_MISSING<0.2' -O z -o ./vcf/hirundo_rustica+smithii.rad.snps.tmp.vcf.gz
- Write SNP positions where H. smithii has missing data to a file.
bcftools view -H -s RS_5 ./vcf/hirundo_rustica+smithii.rad.snps.tmp.vcf.gz | grep -v "0/1" | grep -v "0/0" | grep -v "1/1" | cut -d$'\t' -f1,2 > snp.remove.list
- Remove SNPs with missing genotypes for outgroup H. smithii and thin to retain a single SNP per RAD locus.
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.rad.snps.tmp.vcf.gz --exclude-positions snp.remove.list --thin 10000 --recode --stdout | bgzip -c > ./vcf/hirundo_rustica+smithii.rad.snps.dadi.vcf.gz
- Write a simplified VCF (to be read into polarization script).
(bcftools view -h ./vcf/hirundo_rustica+smithii.rad.snps.dadi.vcf.gz; bcftools query -f "%CHROM\\t%POS\\t%ID\\t%REF\\t%ALT\\t%QUAL\\t%FILTER\\t%INFO\\tGT\\t[%GT\\t]\\n" ./vcf/hirundo_rustica+smithii.rad.snps.dadi.vcf.gz) | cat | bcftools view -m2 -M2 -v snps -O z -o ./vcf/hirundo_rustica+smithii.rad.snps.dadi.fix.vcf.gz
- Polarize VCF by outgroup using the script written by K. Ullrich available here.
python polarizeVCFbyOutgroup.py -vcf ./vcf/hirundo_rustica+smithii.rad.snps.dadi.fix.vcf.gz -out ./vcf/hirundo_rustica+smithii.rad.snps.dadi.polarized.vcf.gz -ind 1 -add
We'll convert the SNP data in the polarized VCF to an unfolded jSFS for downstream analysis using easySFS
popmap.rustica-tytleri
popmap.rustica-gutturalis
popmap.tytleri-gutturalis
conda activate easySFS
./easySFS/easySFS.py -i ./vcf/hirundo_rustica+smithii.rad.snps.dadi.polarized.vcf.gz -p popmap.rustica-tytleri --unfolded -a --preview
./easySFS/easySFS.py -i ./vcf/hirundo_rustica+smithii.rad.snps.dadi.polarized.vcf.gz -p popmap.rustica-gutturalis --unfolded -a --preview
./easySFS/easySFS.py -i ./vcf/hirundo_rustica+smithii.rad.snps.dadi.polarized.vcf.gz -p popmap.tytleri-gutturalis --unfolded -a --preview
A sample size of 96 gives a reasonably high number of segregating sites per analysis.
./easySFS/easySFS.py -i ./vcf/hirundo_rustica+smithii.rad.snps.dadi.polarized.vcf.gz -p popmap.rustica-tytleri --unfolded -a -o sfs_rustica-tytleri --prefix sfs.rustica-tytleri --proj 96,96 -f
./easySFS/easySFS.py -i ./vcf/hirundo_rustica+smithii.rad.snps.dadi.polarized.vcf.gz -p popmap.rustica-gutturalis --unfolded -a -o sfs_rustica-gutturalis --prefix sfs.rustica-gutturalis --proj 96,96 -f
./easySFS/easySFS.py -i ./vcf/hirundo_rustica+smithii.rad.snps.dadi.polarized.vcf.gz -p popmap.tytleri-gutturalis --unfolded -a -o sfs_tytleri-gutturalis --prefix sfs.tytleri-gutturalis --proj 96,96 -f
conda deactivate
cp ./sfs_rustica-tytleri/dadi/rustica-tytleri.sfs ./dadi_rustica-tytleri/03-data/
cp ./sfs_rustica-gutturalis/dadi/rustica-gutturalis.sfs ./dadi_rustica-gutturalis/03-data/
cp ./sfs_tytleri-gutturalis/dadi/tytleri-gutturalis.sfs ./dadi_tytleri-gutturalis/03-data/
Here, we'll set up a virtual environment with the necessary dependencies for dadi to run based on the workflow and scripts from Rougemont et al. 2017.
Dependencies:
- Python 2.7
- scipy 0.13 or older
- numpy
- matplotlib
- pylab
- parallel
- libgfortran3
- mkl
Note, this was not the most straightforward to get set up. Specific details below should help avoid unnecessary hangups.
mkdir dadi-install
cd dadi-install
virtualenv -p /usr/bin/python2.7 dadi-env
source dadi-env/bin/activate
cd ..
Run deactivate
to exit virtual environment.
cd dadi_rustica-tytleri/01-scripts
pip install scipy-0.13.3-cp27-cp27mu-linux_x86_64.whl
cd ..
pip install numpy
pip install matplotlib==2.0.0
sudo apt-get install parallel
sudo apt-get install libgfortran3
pip install mkl
cd dadi-install/dadi-env/lib/
mv libmkl_rt.so.2 libmkl_rt.so
cd ../../../
We need to run this during each new terminal instance prior to running analysis:
export LD_LIBRARY_PATH=/data3/hirundo/analysis/dadi/dadi-install/dadi-env/lib/:$LD_LIBRARY_PATH
chmod 777 ./dadi_rustica-tytleri/01-scripts/*.sh
chmod 777 ./dadi_rustica-gutturalis/01-scripts/*.sh
chmod 777 ./dadi_tytleri-gutturalis/01-scripts/*.sh
cd ./analysis/dadi
source dadi-env/bin/activate
export LD_LIBRARY_PATH=/data3/hirundo/analysis/dadi/dadi-install/dadi-env/lib/:$LD_LIBRARY_PATH
cd dadi_rustica-tytleri
./01-scripts/01-run_model_iteration_v2.sh 1 ./03-data/rustica-tytleri.sfs SI unfolded 80
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs SI unfolded 80
Both scripts should run without errors and produce output files in dadi_rustica-tytleri
.
sh ./01-scripts/01-run_model_iteration_v2.sh 55 ./03-data/rustica-tytleri.sfs SI unfolded 80
sh ./01-scripts/01-run_model_iteration_v2.sh 55 ./03-data/rustica-tytleri.sfs AM unfolded 80
sh ./01-scripts/01-run_model_iteration_v2.sh 55 ./03-data/rustica-tytleri.sfs IM unfolded 80
sh ./01-scripts/01-run_model_iteration_v2.sh 55 ./03-data/rustica-tytleri.sfs SC unfolded 80
sh ./01-scripts/01-run_model_iteration_v2.sh 55 ./03-data/rustica-tytleri.sfs AM2m unfolded 80
sh ./01-scripts/01-run_model_iteration_v2.sh 55 ./03-data/rustica-tytleri.sfs IM2m unfolded 80
sh ./01-scripts/01-run_model_iteration_v2.sh 55 ./03-data/rustica-tytleri.sfs SC2m unfolded 80
The 'SC' model fails with "ValueError: need more than 7 values to unpack". If we look at the models detailed in 02-modifs_v2/unfolded/modeledemo_new_models.py
and there are two 'SC' models. The second includes more than 7 parameters (we will comment this out below).
- Make subdirectory to tinker with scripts.
mkdir script_edits
cd script_edits
cp 02-modifs_v2/unfolded/modeledemo_new_models.py .
cp 01-scripts/00.run_dadi_parallel_v2.sh .
-
Edit
modeledemo_new_models.py
to comment out second 'SC' model. -
Edit
00.run_dadi_parallel_v2.sh
to perform 20 iterations. -
Distribute edited scripts to analysis directories.
cp modeledemo_new_models.py ../dadi_rustica-tytleri/02-modifs_v2/unfolded/
cp modeledemo_new_models.py ../dadi_rustica-gutturalis/02-modifs_v2/unfolded/
cp modeledemo_new_models.py ../dadi_tytleri-gutturalis/02-modifs_v2/unfolded/
cp 00.run_dadi_parallel_v2.sh ../dadi_rustica-tytleri/01-scripts/
cp 00.run_dadi_parallel_v2.sh ../dadi_rustica-gutturalis/01-scripts/
cp 00.run_dadi_parallel_v2.sh ../dadi_tytleri-gutturalis/01-scripts/
We now have an environment set up to run demographic models.
A note on grid size: the dadi
manual suggests setting a grid size to be somewhat larger than the largest dimension of the SFS analyzed (this point is reiterated here by R. Gutenkunst. Larger grid sizes provide higher accuracy at the expense of longer runtimes. The analysis scripts here multiply the input grid size by 2.
source ./dadi-install/dadi-env/bin/activate
export LD_LIBRARY_PATH=/data3/hirundo/analysis/dadi/dadi-install/dadi-env/lib/:$LD_LIBRARY_PATH
cd dadi_rustica-tytleri
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs SI unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs AM unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs IM unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs SC unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs AM2m unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs IM2m unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs SC2m unfolded 55
cd ..
To check on a running model (e.g., 'SI'):
for i in ./10-log/SI_*; do tail $i; done
source ./dadi-install/dadi-env/bin/activate
export LD_LIBRARY_PATH=/data3/hirundo/analysis/dadi/dadi-install/dadi-env/lib/:$LD_LIBRARY_PATH
cd dadi_rustica-gutturalis
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs SI unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs AM unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs IM unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs SC unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs AM2m unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs IM2m unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs SC2m unfolded 55
cd ..
source ./dadi-install/dadi-env/bin/activate
export LD_LIBRARY_PATH=/data3/hirundo/analysis/dadi/dadi-install/dadi-env/lib/:$LD_LIBRARY_PATH
cd dadi_tytleri-gutturalis
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs SI unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs AM unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs IM unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs SC unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs AM2m unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs IM2m unfolded 55
./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs SC2m unfolded 55
cd ..
- Write
modelLL.sh
to./analysis/dadi/script_edits
and distribute to analysis directories.
cd ./analysis/dadi/script_edits
cp modelLL.sh ../dadi_rustica-tytleri
cp modelLL.sh ../dadi_rustica-gutturalis
cp modelLL.sh ../dadi_tytleri-gutturalis
cd ..
- Run script to parse best-fitting model.
cd dadi_rustica-tytleri
sh modelLL.sh
cd ..
cd dadi_rustica-gutturalis
sh modelLL.sh
cd ..
cd dadi_tytleri-gutturalis
sh modelLL.sh
cd ..
Here, we'll run 20 iterations of the best-fit models from above, but with larger grid size to increase parameter estimation accuracy.
source ./dadi-install/dadi-env/bin/activate
export LD_LIBRARY_PATH=/data3/hirundo/analysis/dadi/dadi-install/dadi-env/lib/:$LD_LIBRARY_PATH
cd dadi_rustica-tytleri
nohup ./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-tytleri.sfs SC2m unfolded 70 &
for i in SC2m_*; do model=`echo $i | cut -d'_' -f1`; iter=`echo $i | cut -d'_' -f2`; ll=`grep 'Optimized log-likelihood' $i/$i.txt | tail -n1`; aic=`grep 'AIC' $i/$i.txt | tail -n1`; theta=`grep 'theta' $i/$i.txt | tail -n1`; echo $model $iter $ll $aic $theta | sort -k3; done
cd ..
cd dadi_rustica-gutturalis
nohup ./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/rustica-gutturalis.sfs SC2m unfolded 70 &
for i in SC2m_*; do model=`echo $i | cut -d'_' -f1`; iter=`echo $i | cut -d'_' -f2`; ll=`grep 'Optimized log-likelihood' $i/$i.txt | tail -n1`; aic=`grep 'AIC' $i/$i.txt | tail -n1`; theta=`grep 'theta' $i/$i.txt | tail -n1`; echo $model $iter $ll $aic $theta | sort -k3; done
cd ..
cd dadi_tytleri-gutturalis
nohup ./01-scripts/00.run_dadi_parallel_v2.sh ./03-data/tytleri-gutturalis.sfs SC unfolded 70 &
for i in SC_*; do model=`echo $i | cut -d'_' -f1`; iter=`echo $i | cut -d'_' -f2`; ll=`grep 'Optimized log-likelihood' $i/$i.txt | tail -n1`; aic=`grep 'AIC' $i/$i.txt | tail -n1`; theta=`grep 'theta' $i/$i.txt | tail -n1`; echo $model $iter $ll $aic $theta | sort -k3; done
cd ..
We'll map genetic associations with ventral color and tail streamer length variation using Bayesian sparse linear mixed models (BSLMM) and univariate linear mixed models (LMM) in GEMMA
.
cd ./analysis/
mkdir gemma
cd gemma
mkdir vcf
mkdir vcf_imputed
mkdir phenotype_lists
mkdir input
Format popmaps.
popmap.all
popmap.all.male
popmap.all.female
cd ~/hirundo_speciation_genomics/tmp/
wget https://github.com/genetics-statistics/GEMMA/releases/download/v0.98.4/gemma-0.98.4-linux-static-AMD64.gz
chmod +x gemma-0.98.4-linux-static-AMD64
mv gemma-0.98.4-linux-static-AMD64 gemma
sudo cp gemma /usr/local/bin/
cd ../analysis/gemma
wget https://faculty.washington.edu/browning/beagle/beagle.28Jun21.220.jar
GEMMA
requires complete genotype information, so we will impute missing genotypes using beagle
.
bcftools concat -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.maf05.ingroup.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -S popmap.all.male -O v ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.vcf.gz
bcftools view --threads 16 -S popmap.all.female -O v ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.female.vcf.gz
bcftools view --threads 16 -S popmap.all.female -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.maf05.female.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.maf05.ingroup.vcf.gz
bcftools concat --threads 16 -O z -o ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.female.vcf.gz ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.maf05.female.vcf.gz
rm ./vcf/hirundo_rustica+smithii.allsites.final.chrW.snps.miss02.maf05.female.vcf.gz
rm ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.female.vcf.gz
java -Xmx96g -jar beagle.28Jun21.220.jar nthreads=16 gt=./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz out=./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute
java -Xmx96g -jar beagle.28Jun21.220.jar nthreads=16 gt=./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.vcf.gz out=./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute
java -Xmx96g -jar beagle.28Jun21.220.jar nthreads=16 gt=./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.vcf.gz out=./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute
tabix -p vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.vcf.gz
tabix -p vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute.vcf.gz
tabix -p vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute.vcf.gz
There are cases where samples have incomplete phenotype matrices. GEMMA
requires complete data, so we'll downsample our imputed genotypes accordingly.
Sample sizes:
- Hybrid ventral color n = 159
- Hybrid tail streamer length n = 151
- Full ventral color n = 305
- Full tail streamer length n = 300
- Male ventral color n = 157
- Male tail streamer length n = 151
- Female ventral color n = 148
- Female tail streamer length n = 149
Lists of samples in these categories are in ./phenotype_lists/
.
bcftools view --threads 16 -S ./phenotype_lists/list.hybrid.plum -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.hybrid_all.impute.plum.vcf.gz
bcftools view --threads 16 -S ./phenotype_lists/list.hybrid.tail -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.hybrid_all.impute.tail.vcf.gz
bcftools view --threads 16 -S ./phenotype_lists/list.full.plum -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.plum.vcf.gz
bcftools view --threads 16 -S ./phenotype_lists/list.full.tail -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.tail.vcf.gz
bcftools view --threads 16 -S ./phenotype_lists/list.male.plum -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute.plum.vcf.gz
bcftools view --threads 16 -S ./phenotype_lists/list.male.tail -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute.tail.vcf.gz
bcftools view --threads 16 -S ./phenotype_lists/list.female.plum -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute.plum.vcf.gz
bcftools view --threads 16 -S ./phenotype_lists/list.female.tail -O v ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute.vcf.gz | bcftools filter -e 'F_MISSING > 0.2 || MAF <= 0.05' -O z -o ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute.tail.vcf.gz
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.hybrid_all.impute.plum.vcf.gz --make-bed --out ./input/gwas.hybrid.plum --allow-extra-chr
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.hybrid_all.impute.tail.vcf.gz --make-bed --out ./input/gwas.hybrid.tail --allow-extra-chr
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.plum.vcf.gz --make-bed --out ./input/gwas.full.plum --allow-extra-chr
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.impute.tail.vcf.gz --make-bed --out ./input/gwas.full.tail --allow-extra-chr
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute.plum.vcf.gz --make-bed --out ./input/gwas.male.plum --allow-extra-chr
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.male.impute.tail.vcf.gz --make-bed --out ./input/gwas.male.tail --allow-extra-chr
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute.plum.vcf.gz --make-bed --out ./input/gwas.female.plum --allow-extra-chr
plink --vcf ./vcf_imputed/hirundo_rustica+smithii.allsites.final.auto+chrZ+chrW.snps.miss02.maf05.female.impute.tail.vcf.gz --make-bed --out ./input/gwas.female.tail --allow-extra-chr
Here, we'll edit the .fam files generated by plink
to include the phenotype data.
cd input
mkdir original_fam
cp *.fam original_fam
cd ..
Note: we've added randomized phenotypes to the .fam files for the full dataset.
Dataset | File | n1 | n2 | n3 | n4 | n5 |
---|---|---|---|---|---|---|
Hybrid | plum | breast | ||||
Hybrid | tail | tail | ||||
Full | plum | throat | breast | belly | vent | breast random |
Full | tail | tail | tail random | |||
Male | plum | throat | breast | belly | vent | |
Male | tail | tail | ||||
Female | plum | throat | breast | belly | vent | |
Female | tail | tail |
gemma -bfile ./input/gwas.hybrid.plum -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.hybrid.plum
gemma -bfile ./input/gwas.hybrid.tail -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.hybrid.tail
gemma -bfile ./input/gwas.full.plum -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.full.plum
gemma -bfile ./input/gwas.full.tail -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.full.tail
gemma -bfile ./input/gwas.male.plum -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.male.plum
gemma -bfile ./input/gwas.male.tail -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.male.tail
gemma -bfile ./input/gwas.female.plum -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.female.plum
gemma -bfile ./input/gwas.female.tail -gk 1 -miss 1 -maf 0 -r2 1 -hwe 0 -o ./gemma.female.tail
The output files are written to the auto-generated ./output/
subdirectory.
To characterize the overall genetic architecture and proportions of variance explained by all SNPs (PVE) and the proportion of variance explained by sparse effects (alleles of large effect; PGE), we'll run Bayesian sparse linear mixed models in GEMMA
. We'll run 10 independent chains per analysis with 5 million generations of burn-in and 25 million sampling generations.
- Run
runBSLMM.sh
to perform BSLMM analysis.
cd output
mkdir log
cd ..
sh runBSLMM.sh ./input/gwas.hybrid.plum ./output/gemma.hybrid.plum.cXX.txt 5000000 25000000 1 hybrid-breast-bright > ./output/log/bslmm.hybrid-breast-bright.log
sh runBSLMM.sh ./input/gwas.hybrid.tail ./output/gemma.hybrid.tail.cXX.txt 5000000 25000000 1 hybrid-tail-streamer > ./output/log/bslmm.hybrid-tail-streamer.log
- Summarize results using R scripts
processBSLMM_full-breast-bright.R
andprocessBSLMM_full-tail-streamer.R
, adopted from the script from K. Delmore and E. de Greef used in de Greef et al. 2023. They take as input a specified file name convention, then parse and combine results from the multiple runs, outputting summary tables for hyperparameters, as well as specific tables for the parameters, including sets of SNPs passing PIP thresholds and an output of sparse effects for all SNPs (i.e., means of all runs).
Rscript processBSLMM_hybrid-breast-bright.R
Rscript processBSLMM_hybrid-tail-streamer.R
To summarize the posterior distributions of hyperparameters and per-SNP posterior inclusion probabilities, run ~/hirundo_speciation_genomics/R/gemma_BSLMM.R
.
Run LMMs for hybrid, full (including randomized phenotypes), male, and female datasets for both traits
gemma -bfile ./input/gwas.hybrid.plum -k ./output/gemma.hybrid.plum.cXX.txt -lmm 4 -n 1 -o gwas_full_lmm.hybrid-breast-bright
gemma -bfile ./input/gwas.hybrid.tail -k ./output/gemma.hybrid.tail.cXX.txt -lmm 4 -n 1 -o gwas_full_lmm.hybrid-tail-streamer
gemma -bfile ./input/gwas.full.tail -k ./output/gemma.full.tail.cXX.txt -lmm 4 -n 1 -o gwas_full_lmm.full-tail-streamer
gemma -bfile ./input/gwas.full.plum -k ./output/gemma.full.plum.cXX.txt -lmm 4 -n 2 -o gwas_full_lmm.full-breast-bright
gemma -bfile ./input/gwas.full.tail -k ./output/gemma.full.tail.cXX.txt -lmm 4 -n 2 -o gwas_full_lmm.full-tail-streamer-random
gemma -bfile ./input/gwas.full.plum -k ./output/gemma.full.plum.cXX.txt -lmm 4 -n 5 -o gwas_full_lmm.full-breast-bright-random
gemma -bfile ./input/gwas.male.tail -k ./output/gemma.male.tail.cXX.txt -lmm 4 -n 1 -o gwas_full_lmm.male-tail-streamer
gemma -bfile ./input/gwas.male.plum -k ./output/gemma.male.plum.cXX.txt -lmm 4 -n 2 -o gwas_full_lmm.male-breast-bright
gemma -bfile ./input/gwas.female.tail -k ./output/gemma.female.tail.cXX.txt -lmm 4 -n 1 -o gwas_full_lmm.female-tail-streamer
gemma -bfile ./input/gwas.female.plum -k ./output/gemma.female.plum.cXX.txt -lmm 4 -n 2 -o gwas_full_lmm.female-breast-bright
The output files for LMMs are large, so we can prune these down considerably by filtering out SNPs with a -log10(Wald P-value) < 2 (primarily for plotting purposes).
mkdir lmm_full_prune
cd output
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.hybrid-breast-bright.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.hybrid-breast-bright.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.hybrid-tail-streamer.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.hybrid-tail-streamer.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.full-tail-streamer.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.full-tail-streamer.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.full-breast-bright.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.full-breast-bright.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.full-tail-streamer-random.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.full-tail-streamer-random.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.full-breast-bright-random.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.full-breast-bright-random.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.male-tail-streamer.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.male-tail-streamer.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.male-breast-bright.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.male-breast-bright.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.female-tail-streamer.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.female-tail-streamer.assoc.prune.txt
awk '{if (-log($13)/log(10)>=2) print $0}' ./gwas_full_lmm.female-breast-bright.assoc.txt | grep -v 'NW_' > ../lmm_full_prune/gwas_full_lmm.female-breast-bright.assoc.prune.txt
cd ..
To enable searching of genome features associated with significant SNPs, we'll parse SNPs with a Wald P-value < 0.05 after Bonferroni correction.
Table of SNPs in each model:
Trait | Dataset | SNPs |
---|---|---|
ventral color | hybrid | 9033285 |
ventral color | full | 9311628 |
ventral color | male | 8851265 |
ventral color | female | 8757718 |
tail streamer | hybrid | 8870504 |
tail streamer | full | 9246603 |
tail streamer | male | 8743581 |
tail streamer | female | 8773176 |
mkdir significant
cd output
awk '{if ($13 < (0.05/9033285)) print $0}' ./gwas_full_lmm.hybrid-breast-bright.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.hybrid-breast-bright.assoc.sig.txt
awk '{if ($13 < (0.05/8870504)) print $0}' ./gwas_full_lmm.hybrid-tail-streamer.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.hybrid-tail-streamer.assoc.sig.txt
awk '{if ($13 < (0.05/9311628)) print $0}' ./gwas_full_lmm.full-tail-streamer.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.full-tail-streamer.assoc.sig.txt
awk '{if ($13 < (0.05/9246603)) print $0}' ./gwas_full_lmm.full-breast-bright.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.full-breast-bright.assoc.sig.txt
awk '{if ($13 < (0.05/9311628)) print $0}' ./gwas_full_lmm.full-tail-streamer-random.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.full-tail-streamer-random.assoc.sig.txt
awk '{if ($13 < (0.05/9246603)) print $0}' ./gwas_full_lmm.full-breast-bright-random.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.full-breast-bright-random.assoc.sig.txt
awk '{if ($13 < (0.05/8851265)) print $0}' ./gwas_full_lmm.male-tail-streamer.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.male-tail-streamer.assoc.sig.txt
awk '{if ($13 < (0.05/8743581)) print $0}' ./gwas_full_lmm.male-breast-bright.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.male-breast-bright.assoc.sig.txt
awk '{if ($13 < (0.05/8757718)) print $0}' ./gwas_full_lmm.female-tail-streamer.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.female-tail-streamer.assoc.sig.txt
awk '{if ($13 < (0.05/8773176)) print $0}' ./gwas_full_lmm.female-breast-bright.assoc.txt | grep -v 'NW_' > ../significant/gwas_full_lmm.female-breast-bright.assoc.sig.txt
cd ..
mkdir annotation
mkdir significant_annotation
cd annotation
- Make 'gene' and 'exon' GFF files from the full genomic GFF annotation in
~/hirundo_speciation_genomics/genome_annotation
.
$awk '{if ($3 == "gene") print $0}' ~/hirundo_speciation_genomics/genome_annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.gff > ~/hirundo_speciation_genomics/genome_annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.gene.gff
$awk '{if ($3 == "exon") print $0}' ~/hirundo_speciation_genomics/genome_annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.gff | bedtools sort -i - > ~/hirundo_speciation_genomics/genome_annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.exon.gff
- Convert significant SNP tables to BED format.
cd ./significant
for i in *.sig.txt; do iname=${i%.txt}; tail -n+2 $i | awk 'BEGIN{OFS="\t"}{print $1,$3-1,$3,$13}' | bedtools sort -i - > $iname.bed; done
for i in *-random.assoc.sig.txt; do iname=${i%.txt}; tail -n+2 $i | awk 'BEGIN{OFS="\t"}{print $1,$3-1,$3,$13}' | bedtools sort -i - > $iname.bed; done
- Run bedtools
intersect
to find overlaps between significant SNPs and genes.
for i in *.sig.bed; do iname=${i%.bed}; echo -e "chrom-snp\tstart-snp\tend-snp\tWaldP\tchrom-gene\tstart-gene\tend-gene\tstrand\tfeature" > ../significant_annotation/$iname.gene-intersect.txt; bedtools intersect -wo -a $i -b ../annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.gene.sort.gff | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,$5,$8,$9,$11,$13}' >> ../significant_annotation/$iname.gene-intersect.txt; done
for i in *-random.assoc.sig.bed; do iname=${i%.bed}; echo -e "chrom-snp\tstart-snp\tend-snp\tWaldP\tchrom-gene\tstart-gene\tend-gene\tstrand\tfeature" > ../significant_annotation/$iname.gene-intersect.txt; bedtools intersect -wo -a $i -b ../annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.gene.sort.gff | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,$5,$8,$9,$11,$13}' >> ../significant_annotation/$iname.gene-intersect.txt; done
- Run bedtools
intersect
to find genes within 50 kb of significant SNPs.
for i in *.sig.bed; do iname=${i%.bed}; echo -e "chrom-snp\tstart-snp\tend-snp\tWaldP\tchrom-gene\tstart-gene\tend-gene\tstrand\tfeature" > ../significant_annotation/$iname.gene-intersect+50kb.txt; awk '{OFS="\t"}{print $1,$2-50000,$3+50000,$4}' $i | awk '{OFS="\t"}{print($1,$2<0?0:$2,$3,$4)}' | bedtools intersect -wo -a - -b ../annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.gene.sort.gff | awk 'BEGIN{OFS="\t"}{print $1,$2+50000,$3-50000,$4,$5,$8,$9,$11,$13}' >> ../significant_annotation/$iname.gene-intersect+50kb.txt; done
for i in *-random.assoc.sig.bed; do iname=${i%.bed}; echo -e "chrom-snp\tstart-snp\tend-snp\tWaldP\tchrom-gene\tstart-gene\tend-gene\tstrand\tfeature" > ../significant_annotation/$iname.gene-intersect+50kb.txt; awk '{OFS="\t"}{print $1,$2-50000,$3+50000,$4}' $i | awk '{OFS="\t"}{print($1,$2<0?0:$2,$3,$4)}' | bedtools intersect -wo -a - -b ../annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.gene.sort.gff | awk 'BEGIN{OFS="\t"}{print $1,$2+50000,$3-50000,$4,$5,$8,$9,$11,$13}' >> ../significant_annotation/$iname.gene-intersect+50kb.txt; done
- Annotation of genic, coding versus noncoding significant SNPs.
tail -n+2 gwas_full_lmm.hybrid-breast-bright.assoc.sig.gene-intersect.txt | awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | bedtools closest -d -t first -a - -b ~/hirundo_speciation_genomics/genome_annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.exon.gff > gwas_full_lmm.hybrid-breast-bright.assoc.sig.gene-intersect.exon-distance.txt
tail -n+2 gwas_full_lmm.hybrid-tail-streamer.assoc.sig.gene-intersect.txt | awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' | bedtools closest -d -t first -a - -b ~/hirundo_speciation_genomics/genome_annotation/GCF_015227805.1_bHirRus1.pri.v2_genomic.exon.gff > gwas_full_lmm.hybrid-tail-streamer.assoc.sig.gene-intersect.exon-distance.txt
mkdir lmm_full_chrom
for gwas in ./output/gwas_full_lmm*; do iname=${gwas%.txt}; fix=`echo $iname | cut -d'/' -f3`; for chrom in chr1A-NC_053453.1 chr2-NC_053450.1 chrZ-NC_053488.1; do scaff=`echo $chrom | cut -d'-' -f2`; head -n1 $gwas > ./lmm_full_chrom/$fix.$chrom.txt; grep -w $scaff $gwas >> ./lmm_full_chrom/$fix.$chrom.txt; done; done
rm ./lmm_full_chrom/*.log.*
To summarize and plot the results, run ~/hirundo_speciation_genomics/R/gemma_LMM.R
, which also uses the chr_rename.txt
file.
We will estimate recombination rate variation across the genome using pyrho
, which makes use of a population size history inferred using SMC++
.
cd ./analysis/
mkdir smc++
mkdir pyrho
We will use information about population history (i.e., population size at epoch times) to inform recombination rate analysis. SMC++ can infer population history from multiple samples and unphased genotypes. It is also capable of inferring population split times when two populations are analyzed together. Here, we will install SMC++, run it on test data to ensure the build is working properly on the system, then perform analysis on the barn swallows, with the goal of providing a population history to downstream recombination rate inference.
SMC++
relies on a few specific dependencies. We'll install everything together in a virtual environment.
- Set up install directory
cd ~/hirundo_speciation_genomics/tmp/
mkdir smc++-install
cd smc++-install
- Install Python 3.8 and libraries
sudo apt install python3.8
sudo apt-get install python3.8-dev
- Create and activate virtual environment for
SMC++
.
virtualenv -p /usr/bin/python3.8 smc++-env
source smc++-env/bin/activate
- Install library requirements.
sudo apt-get install -y python3-dev libgmp-dev libmpfr-dev libgsl0-dev
- Install
SMC++
.
pip install git+https://github.com/popgenmethods/smcpp
- Test that the build worked in the virtual environment.
smc++ vcf2smc -h
- Clone repository with example data.
cd ./analysis/smc++
mkdir test_example
cd test_example
mkdir out
mkdir analysis
git clone https://github.com/popgenmethods/smcpp
- Convert VCF to SMC format.
source ~/hirundo_speciation_genomics/tmp/smc++-env/bin/activate
smc++ vcf2smc ../smcpp/example/example.vcf.gz out/example.smc.gz 1 Pop1:msp_0,msp_1
- Fit the model using
estimate
.
smc++ estimate -o analysis/ 1.25e-8 out/example.smc.gz
- Visualize the results using
plot
.
smc++ plot plot.pdf analysis/model.final.json -c
We need a representative population for downstream recombination rate estimation. To avoid issues related to population substructure, we'll analyze rustica from Karasuk, Russia (n = 10). We'll perform analysis using information from all autosomes.
cd ./analysis/smc++
mkdir out
mkdir analysis
mkdir vcf
- Format
popmap.rustica.karasuk
. - Extract random 5 'distinguished' samples to iterate SMC++ conversion over.
shuf -n 5 popmap.rustica.karasuk > distinguished.list
- Format
chromosome-scaffold.auto.table.txt
with conversion between chromosome names and scaffold IDs for autosomes. - Generate input VCF.
bcftools view --threads 16 -S popmap.rustica.karasuk -c 2:minor -m2 -M2 -U -v snps -O z -o ./vcf/rustica.allsites.final.auto.snps.miss02.mac2.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.vcf.gz
tabix -p vcf ./vcf/rustica.allsites.final.auto.snps.miss02.mac2.vcf.gz
- Extract VCF for each autosome.
while read i; do scaff=`echo "$i" | cut -f 1`; chrom=`echo "$i" | cut -f 2`; bcftools view --threads 16 -r $scaff -S popmap.rustica.karasuk -O z -o ./vcf/rustica.karasuk.allsites.final.auto.snps.miss02.mac2.$chrom.vcf.gz ./vcf/rustica.allsites.final.auto.snps.miss02.mac2.vcf.gz; done < ./chromosome-scaffold.auto.table.txt
for vcf in ./vcf/*.vcf.gz; do tabix -C -p vcf $vcf; done
- Convert VCFs to SMC input format.
source ~/hirundo_speciation_genomics/tmp/smc++-env/bin/activate
while read i; do scaff=`echo "$i" | cut -f 1`; chrom=`echo "$i" | cut -f 2`; for indv in `cat distinguished.list`; do smc++ vcf2smc -c 50000 ./vcf/rustica.karasuk.allsites.final.auto.snps.miss02.mac2.$chrom.vcf.gz out/rustica.$chrom.$indv.smc.gz $scaff Pop1:HRVN96101,HRVN96107,HRVN96108,HRVN96103,HRVN96104,HRVN96105,HRVN96106,HRVN96300,HRVN96102,HRVN96298 -d $indv $indv; done; done < ./chromosome-scaffold.auto.table.txt
- Fit the model using
estimate
.
smc++ estimate --timepoints 1000 200000 -o analysis/ 2.3e-9 out/rustica.*.smc.gz
- Plot/output results table.
smc++ plot rustica-SMC.pdf analysis/model.final.json -g 1 -c
This writes rustica-SMC.csv
and rustica-SMC.pdf
. The .csv file can be used in downstream pyrho
analysis.
Here, we will install pyrho
, run it on test data to ensure the build is working properly on the system, then perform analysis on the barn swallows.
pyrho
relies on several specific dependencies. We'll install everything within a virtual environment.
- Set up install directory.
cd ~/hirundo_speciation_genomics/tmp/
mkdir pyrho-install
cd pyrho-install
- Create and activate new virtual environment for
pyrho
.
virtualenv -p /usr/bin/python3 pyrho-env
source pyrho-env/bin/activate
- Install
ldpop
dependency.
git clone https://github.com/popgenmethods/ldpop.git ldpop
pip install ldpop/
- Install
cython
.
pip install cython
- Install msprime and libraries.
sudo apt-get install python-dev libgsl0-dev
python3 -m pip install msprime --no-binary msprime
- Install
pyrho
.
git clone https://github.com/popgenmethods/pyrho.git pyrho
pip install pyrho/
- Check that install was successful.
python -m pytest pyrho/tests/tests.py
cd ./analysis/pyrho/
mkdir test_example
cd test_example
mkdir out
mkdir analysis
git clone https://github.com/popgenmethods/pyrho.git pyrho
- Precompute a lookup table.
pyrho make_table -n 20 -N 25 --mu 1.25e-8 --logfile . --outfile ACB_n_20_N_40_lookuptable.hdf --approx --smcpp_file ACB_pop_sizes.csv --decimate_rel_tol 0.1
- Run
hyperparam
to find hyperparameters that are a good fit to input demography.
pyrho hyperparam -n 20 --mu 1.25e-8 --blockpenalty 50,100 --windowsize 25,50 --logfile . --tablefile ACB_n_20_N_40_lookuptable.hdf --num_sims 3 --smcpp_file ACB_pop_sizes.csv --outfile ACB_hyperparam_results.txt
- Run
optimize
to estimate fine-scale recombination map.
pyrho optimize --tablefile ACB_n_20_N_40_lookuptable.hdf --vcffile ACB_chr_1_subset.vcf.gz --outfile ACB_chr_1_subset.rmap --blockpenalty 50 --windowsize 50 --logfile .
Now we're set up to run pyrho on each of the ordered chromosomes in the genome. We'll extract chromosome-specific VCFs for rustica to analyze.
cd ./analysis/pyrho
mkdir out
mkdir analysis
mkdir vcf
- Format
chromosome-scaffold.auto.table.txt
. - Format
chromosome.list
(including Z chromosome). - Extract chromosome-specific VCFs.
while read i; do scaff=`echo "$i" | cut -f 1`; chrom=`echo "$i" | cut -f 2`; bcftools view --threads 16 -r $scaff -O z -o ./vcf/rustica.allsites.final.$chrom.snps.miss02.mac2.vcf.gz ../smc++/vcf/rustica.allsites.final.auto.snps.miss02.mac2.vcf.gz; done < ./chromosome-scaffold.auto.table.txt
bcftools view --threads 16 -S popmap.rustica.karasuk -c 2 -O z -o ./vcf/rustica.allsites.final.chrZ.snps.miss02.mac2.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.vcf.gz
- Generate lookup tables.
source ~/hirundo_speciation_genomics/tmp/pyrho-install/pyrho-env/bin/activate
pyrho make_table --numthreads 24 -n 62 -N 78 --mu 2.3e-9 --logfile . --outfile ./lookup/rustica_n_62_N_78_lookuptable.hdf --approx --smcpp_file ../smc++/rustica-SMC.csv
- Find hyperparameter settings that fit the demography.
pyrho hyperparam --numthreads 24 -n 62 --mu 2.3e-9 --smcpp_file ../smc++/rustica-SMC.csv --blockpenalty 20,25,50,100 --windowsize 25,50 --logfile . --tablefile ./lookup/rustica_n_62_N_78_lookuptable.hdf --num_sims 3 --outfile ./hyperparam/rustica_hyperparam_results.txt
A window size of 50 and block penalty of 25 look reasonable.
6. Run optimize
to estimate fine-scale recombination rates.
for chrom in `cat chromosome.list`; do pyrho optimize --numthreads 24 --tablefile ./lookup/rustica_n_62_N_78_lookuptable.hdf --vcffile ./vcf/rustica.allsites.final.$chrom.snps.miss02.mac2.vcf.gz --outfile ./results/rustica.$chrom.rmap --blockpenalty 25 --windowsize 50 --ploidy 2 --logfile .; done
- Format window BED files.
grep 'NC' ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.auto.genome | bedtools makewindows -g - -w 1000000 -s 100000 > ./Hirundo_rustica_bHirRus1.final.1Mb-100kb.bed; grep 'NC' ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.chrZ.genome | bedtools makewindows -g - -w 1000000 -s 100000 >> ./Hirundo_rustica_bHirRus1.final.1Mb-100kb.bed
grep 'NC' ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.auto.genome | bedtools makewindows -g - -w 1000000 > ./Hirundo_rustica_bHirRus1.final.1Mb.bed; grep 'NC' ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.chrZ.genome | bedtools makewindows -g - -w 1000000 >> ./Hirundo_rustica_bHirRus1.final.1Mb.bed
grep 'NC' ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.auto.genome | bedtools makewindows -g - -w 100000 > ./Hirundo_rustica_bHirRus1.final.100kb.bed; grep 'NC' ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.chrZ.genome | bedtools makewindows -g - -w 100000 >> ./Hirundo_rustica_bHirRus1.final.100kb.bed
- Format
chromosome-scaffold.table.txt
(including Z chromosome). - Concatenate the recombination map.
while read i; do scaff=`echo "$i" | cut -f 1`; chrom=`echo "$i" | cut -f 2`; awk -v var=$scaff 'BEGIN{OFS="\t"}{print var,$1,$2,$3}' ./results/rustica.$chrom.rmap >> ./rustica.all.rmap; done < ./chromosome-scaffold.table.txt
- Calculate mean recombination rate in windows.
echo -e "chrom\tstart\tend\trate" > rustica.rmap.1Mb-100kb.txt; bedtools map -a ./Hirundo_rustica_bHirRus1.final.1Mb-100kb.bed -b ./rustica.all.rmap -o mean -c 4 >> rustica.rmap.1Mb-100kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.1Mb.txt; bedtools map -a ./Hirundo_rustica_bHirRus1.final.1Mb.bed -b ./rustica.all.rmap -o mean -c 4 >> rustica.rmap.1Mb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.100kb.txt; bedtools map -a ./Hirundo_rustica_bHirRus1.final.100kb.bed -b ./rustica.all.rmap -o mean -c 4 >> rustica.rmap.100kb.txt
- Format sliding window BED files for specific chromosomes.
echo -e 'NC_053453.1\t76187387' | bedtools makewindows -g - -w 100000 -s 10000 > window.100kb-10kb.chr1A-NC_053453.1.bed
echo -e 'NC_053450.1\t156035725' | bedtools makewindows -g - -w 100000 -s 10000 > window.100kb-10kb.chr2-NC_053450.1.bed
echo -e 'NC_053488.1\t90132487' | bedtools makewindows -g - -w 100000 -s 10000 > window.100kb-10kb.chrZ-NC_053488.1.bed
echo -e 'NC_053453.1\t76187387' | bedtools makewindows -g - -w 50000 -s 5000 > window.50kb-5kb.chr1A-NC_053453.1.bed
echo -e 'NC_053450.1\t156035725' | bedtools makewindows -g - -w 50000 -s 5000 > window.50kb-5kb.chr2-NC_053450.1.bed
echo -e 'NC_053488.1\t90132487' | bedtools makewindows -g - -w 50000 -s 5000 > window.50kb-5kb.chrZ-NC_053488.1.bed
echo -e 'NC_053453.1\t76187387' | bedtools makewindows -g - -w 10000 -s 1000 > window.10kb-1kb.chr1A-NC_053453.1.bed
echo -e 'NC_053450.1\t156035725' | bedtools makewindows -g - -w 10000 -s 1000 > window.10kb-1kb.chr2-NC_053450.1.bed
echo -e 'NC_053488.1\t90132487' | bedtools makewindows -g - -w 10000 -s 1000 > window.10kb-1kb.chrZ-NC_053488.1.bed
- Calculate mean recombination rate in sliding windows on specific chromosomes.
bedtools sort -i rustica.all.rmap > rustica.all.sort.rmap
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chr1A-NC_053453.1.100kb-10kb.txt; bedtools map -a window.100kb-10kb.chr1A-NC_053453.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chr1A-NC_053453.1.100kb-10kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chr2-NC_053450.1.100kb-10kb.txt; bedtools map -a window.100kb-10kb.chr2-NC_053450.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chr2-NC_053450.1.100kb-10kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chrZ-NC_053488.1.100kb-10kb.txt; bedtools map -a window.100kb-10kb.chrZ-NC_053488.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chrZ-NC_053488.1.100kb-10kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chr1A-NC_053453.1.50kb-5kb.txt; bedtools map -a window.50kb-5kb.chr1A-NC_053453.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chr1A-NC_053453.1.50kb-5kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chr2-NC_053450.1.50kb-5kb.txt; bedtools map -a window.50kb-5kb.chr2-NC_053450.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chr2-NC_053450.1.50kb-5kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chrZ-NC_053488.1.50kb-5kb.txt; bedtools map -a window.50kb-5kb.chrZ-NC_053488.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chrZ-NC_053488.1.50kb-5kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chr1A-NC_053453.1.10kb-1kb.txt; bedtools map -a window.10kb-1kb.chr1A-NC_053453.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chr1A-NC_053453.1.10kb-1kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chr2-NC_053450.1.10kb-1kb.txt; bedtools map -a window.10kb-1kb.chr2-NC_053450.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chr2-NC_053450.1.10kb-1kb.txt
echo -e "chrom\tstart\tend\trate" > rustica.rmap.chrZ-NC_053488.1.10kb-1kb.txt; bedtools map -a window.10kb-1kb.chrZ-NC_053488.1.bed -b rustica.all.sort.rmap -o mean -c 4 >> rustica.rmap.chrZ-NC_053488.1.10kb-1kb.txt
Run ~/hirundo_speciation_genomics/R/pyrho.R
to summarize and plot variation in genome-wide recombination rate.
We'll use pixy
to calculate π, Fst, and dxy within and between populations.
cd ./analysis/
mkdir pixy
cd pixy
mkdir results
mkdir results_chrom-specific
A new version of pixy
was released, with improved run-times and handling of variant data (including bypassing large intermediate zarr files during processing). Runs can also be parallelized. htslib
is now a dependency.
We have previously set up pixy
in its own conda environment with Python 3.6.
conda activate pixy
conda remove pixy
conda install -c conda-forge pixy
conda install -c bioconda htslib
pixy
uses a population map as a companion input file for summary statistic calculations, which is a two-column tab-delimited file with sample ID and population ID.
Population abbreviations are:
RU = rustica GU = gutturalis TY = tytleri RT = rustica-tytleri RG = rustica-gutturalis TG = tytleri-gutturalis SA = savignii TR = transitiva ER = erythrogaster
Population maps are in popmap.pixy
and popmap.pixy.subspecies
. Maps for only female samples (for analysis of the W chromosome) are in popmap.pixy.female
and popmap.pixy.subspecies.female
.
We'll run pixy
with the chromosome-specific all-sites VCFs in ~/hirundo_speciation_genomics/vcf/chrom-specific
as input.
- Format
scaffold.auto+chrZ.list
. - Run
pixyloop.sh
to calculate statistics in windows of various lengths.
conda activate pixy
sh pixyloop.sh scaffold.auto+chrZ.list
- Run
pixyloop_subspecies.sh
to calculate statistics in windows with the expanded 'subspecies' popmap.
sh pixyloop_subspecies.sh scaffold.auto+chrZ.list
- Format
scaffold.chrW.list
. - Run
pixyloop_chrW.sh
.
sh pixyloop_chrW.sh scaffold.chrW.list
- Run
pixyloop_chrW_subspecies.sh
.
sh pixyloop_chrW_subspecies.sh scaffold.chrW.list
Here we'll run analyses with sliding windows and intermediate step sizes on chromosomes 1A, 2, and the Z chromosome.
Chromosome | Scaffold | Length |
---|---|---|
1A | NC_053453.1 | 76187387 |
2 | NC_053450.1 | 156035725 |
Z | NC_053488.1 | 90132487 |
- Format sliding window files for analysis using bedtools
makewindows
.
echo -e 'NC_053453.1\t76187387' | bedtools makewindows -g - -w 100000 -s 10000 > window.100kb-10kb.chr1A-NC_053453.1.bed
echo -e 'NC_053450.1\t156035725' | bedtools makewindows -g - -w 100000 -s 10000 > window.100kb-10kb.chr2-NC_053450.1.bed
echo -e 'NC_053488.1\t90132487' | bedtools makewindows -g - -w 100000 -s 10000 > window.100kb-10kb.chrZ-NC_053488.1.bed
echo -e 'NC_053453.1\t76187387' | bedtools makewindows -g - -w 50000 -s 5000 > window.50kb-5kb.chr1A-NC_053453.1.bed
echo -e 'NC_053450.1\t156035725' | bedtools makewindows -g - -w 50000 -s 5000 > window.50kb-5kb.chr2-NC_053450.1.bed
echo -e 'NC_053488.1\t90132487' | bedtools makewindows -g - -w 50000 -s 5000 > window.50kb-5kb.chrZ-NC_053488.1.bed
echo -e 'NC_053453.1\t76187387' | bedtools makewindows -g - -w 10000 -s 1000 > window.10kb-1kb.chr1A-NC_053453.1.bed
echo -e 'NC_053450.1\t156035725' | bedtools makewindows -g - -w 10000 -s 1000 > window.10kb-1kb.chr2-NC_053450.1.bed
echo -e 'NC_053488.1\t90132487' | bedtools makewindows -g - -w 10000 -s 1000 > window.10kb-1kb.chrZ-NC_053488.1.bed
- Run sliding window analyses on each chromosome.
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr1A.vcf.gz --populations popmap.pixy --bed_file window.100kb-10kb.chr1A-NC_053453.1.bed --output_folder results_chrom-specific --output_prefix pixy_chr1A-NC_053453.1_100kb-10kb > pixy_chr1A-NC_053453.1_100kb-10kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr2.vcf.gz --populations popmap.pixy --bed_file window.100kb-10kb.chr2-NC_053450.1.bed --output_folder results_chrom-specific --output_prefix pixy_chr2-NC_053450.1_100kb-10kb > pixy_chr2-NC_053450.1_100kb-10kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz --populations popmap.pixy --bed_file window.100kb-10kb.chrZ-NC_053488.1.bed --output_folder results_chrom-specific --output_prefix pixy_chrZ-NC_053488.1_100kb-10kb > pixy_chrZ-NC_053488.1_100kb-10kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr1A.vcf.gz --populations popmap.pixy --bed_file window.50kb-5kb.chr1A-NC_053453.1.bed --output_folder results_chrom-specific --output_prefix pixy_chr1A-NC_053453.1_50kb-5kb > pixy_chr1A-NC_053453.1_50kb-5kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr2.vcf.gz --populations popmap.pixy --bed_file window.50kb-5kb.chr2-NC_053450.1.bed --output_folder results_chrom-specific --output_prefix pixy_chr2-NC_053450.1_50kb-5kb > pixy_chr2-NC_053450.1_50kb-5kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz --populations popmap.pixy --bed_file window.50kb-5kb.chrZ-NC_053488.1.bed --output_folder results_chrom-specific --output_prefix pixy_chrZ-NC_053488.1_50kb-5kb > pixy_chrZ-NC_053488.1_50kb-5kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr1A.vcf.gz --populations popmap.pixy --bed_file window.10kb-1kb.chr1A-NC_053453.1.bed --output_folder results_chrom-specific --output_prefix pixy_chr1A-NC_053453.1_10kb-1kb > pixy_chr1A-NC_053453.1_10kb-1kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr2.vcf.gz --populations popmap.pixy --bed_file window.10kb-1kb.chr2-NC_053450.1.bed --output_folder results_chrom-specific --output_prefix pixy_chr2-NC_053450.1_10kb-1kb > pixy_chr2-NC_053450.1_10kb-1kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz --populations popmap.pixy --bed_file window.10kb-1kb.chrZ-NC_053488.1.bed --output_folder results_chrom-specific --output_prefix pixy_chrZ-NC_053488.1_10kb-1kb > pixy_chrZ-NC_053488.1_10kb-1kb.log
- Run sliding window analyses on each chromosome between subspecies.
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr1A.vcf.gz --populations popmap.pixy.subspecies --bed_file window.100kb-10kb.chr1A-NC_053453.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chr1A-NC_053453.1_100kb-10kb > pixy_subspecies_chr1A-NC_053453.1_100kb-10kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr2.vcf.gz --populations popmap.pixy.subspecies --bed_file window.100kb-10kb.chr2-NC_053450.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chr2-NC_053450.1_100kb-10kb > pixy_subspecies_chr2-NC_053450.1_100kb-10kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz --populations popmap.pixy.subspecies --bed_file window.100kb-10kb.chrZ-NC_053488.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chrZ-NC_053488.1_100kb-10kb > pixy_subspecies_chrZ-NC_053488.1_100kb-10kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr1A.vcf.gz --populations popmap.pixy.subspecies --bed_file window.50kb-5kb.chr1A-NC_053453.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chr1A-NC_053453.1_50kb-5kb > pixy_subspecies_chr1A-NC_053453.1_50kb-5kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr2.vcf.gz --populations popmap.pixy.subspecies --bed_file window.50kb-5kb.chr2-NC_053450.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chr2-NC_053450.1_50kb-5kb > pixy_subspecies_chr2-NC_053450.1_50kb-5kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz --populations popmap.pixy.subspecies --bed_file window.50kb-5kb.chrZ-NC_053488.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chrZ-NC_053488.1_50kb-5kb > pixy_subspecies_chrZ-NC_053488.1_50kb-5kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr1A.vcf.gz --populations popmap.pixy.subspecies --bed_file window.10kb-1kb.chr1A-NC_053453.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chr1A-NC_053453.1_10kb-1kb > pixy_subspecies_chr1A-NC_053453.1_10kb-1kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chr2.vcf.gz --populations popmap.pixy.subspecies --bed_file window.10kb-1kb.chr2-NC_053450.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chr2-NC_053450.1_10kb-1kb > pixy_subspecies_chr2-NC_053450.1_10kb-1kb.log
pixy --n_cores 6 --stats pi dxy fst --vcf ~/hirundo_speciation_genomics/vcf/chrom-specific/hirundo_rustica+smithii.allsites.final.chrZ.vcf.gz --populations popmap.pixy.subspecies --bed_file window.10kb-1kb.chrZ-NC_053488.1.bed --output_folder results_chrom-specific --output_prefix pixy_subspecies_chrZ-NC_053488.1_10kb-1kb > pixy_subspecies_chrZ-NC_053488.1_10kb-1kb.log
We'll concatenate windowed outputs for all of the chromosomes, which can be parsed further by population.
- Format complete scaffold list.
cat scaffold.auto+chrZ.list scaffold.chrW.list > scaffold.all.list
- Format 'ordered' scaffold list (no unplaced scaffolds).
grep -v '-un' scaffold.all.list > scaffold.order.list
- Run
concatenatePixy.sh
to combine results at different window resolutions.
sh concatenatePixy.sh
- Run
concatenatePixyOrder.sh
to combine results for ordered chromosomes.
sh concatenatePixyOrder.sh
- Run
concatenatePixyOrder_subspecies.sh
.
sh concatenatePixyOrder_subspecies.sh
To say whether regions of the genome strongly associated with traits have experienced divergent selection in the parental populations, we first need to explore the overlap between association peaks and Fst values by answering whether significant SNPs have higher Fst than non-significant SNPs and/or genome background Fst distributions.
cd ./analysis/gemma
mkdir fst
awk 'BEGIN{OFS="\t"}{if ($1=="RU" && $2=="TY") print $3,$4-1,$5,$6}' ../pixy/pixy.all.order.fst.10kb.txt > ./fst/pixy.ruty.order.fst.10kb.bed
awk 'BEGIN{OFS="\t"}{if ($1=="RU" && $2=="GU") print $3,$4-1,$5,$6}' ../pixy/pixy.all.order.fst.10kb.txt > ./fst/pixy.rugu.order.fst.10kb.bed
awk 'BEGIN{OFS="\t"}{if ($1=="GU" && $2=="TY") print $3,$4-1,$5,$6}' ../pixy/pixy.all.order.fst.10kb.txt > ./fst/pixy.guty.order.fst.10kb.bed
for pop in ruty rugu guty; do echo -e "chrom\tsnp-start\tsnp-end\twald-p\twindow-start\twindow-end\tfst" > ./fst/gwas_full_lmm.full-breast-bright.assoc.sig.fst-${pop}.txt; bedtools intersect -a ./significant/gwas_full_lmm.full-breast-bright.assoc.sig.bed -b ./fst/pixy.$pop.order.fst.10kb.bed -wao | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8}' | awk -F"\t" '!seen[$5, $6, $7]++' - >> ./fst/gwas_full_lmm.full-breast-bright.assoc.sig.fst-${pop}.txt; done
for pop in ruty rugu guty; do echo -e "chrom\tsnp-start\tsnp-end\twald-p\twindow-start\twindow-end\tfst" > ./fst/gwas_full_lmm.full-tail-streamer.assoc.sig.fst-${pop}.txt; bedtools intersect -a ./significant/gwas_full_lmm.full-tail-streamer.assoc.sig.bed -b ./fst/pixy.$pop.order.fst.10kb.bed -wao | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8}' | awk -F"\t" '!seen[$5, $6, $7]++' - >> ./fst/gwas_full_lmm.full-tail-streamer.assoc.sig.fst-${pop}.txt; done
for pop in ruty rugu guty; do echo -e "chrom\tsnp-start\tsnp-end\twald-p\twindow-start\twindow-end\tfst" > ./fst/gwas_full_lmm.full-breast-bright.assoc.nosig.fst-${pop}.txt; awk 'BEGIN{OFS="\t"}{if (-log($13)/log(10)<5) print $1,$3-1,$3,$13}' ./output/gwas_full_lmm.full-breast-bright.assoc.txt | grep -v 'NW_' | bedtools intersect -a - -b ./fst/pixy.$pop.order.fst.10kb.bed -wao | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8}' | awk -F"\t" '!seen[$5, $6, $7]++' - >> ./fst/gwas_full_lmm.full-breast-bright.assoc.nosig.fst-${pop}.txt; done
for pop in ruty rugu guty; do echo -e "chrom\tsnp-start\tsnp-end\twald-p\twindow-start\twindow-end\tfst" > ./fst/gwas_full_lmm.full-tail-streamer.assoc.nosig.fst-${pop}.txt; awk 'BEGIN{OFS="\t"}{if (-log($13)/log(10)<5) print $1,$3-1,$3,$13}' ./output/gwas_full_lmm.full-tail-streamer.assoc.txt | grep -v 'NW_' | bedtools intersect -a - -b ./fst/pixy.$pop.order.fst.10kb.bed -wao | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8}' | awk -F"\t" '!seen[$5, $6, $7]++' - >> ./fst/gwas_full_lmm.full-tail-streamer.assoc.nosig.fst-${pop}.txt; done
Run ~/hirundo_speciation_genomics/R/pixy.R
to summarize, plot, and statistically compare π, Fst, dxy, and recombination rate.
Run ~/hirundo_speciation_genomics/R/candidate_plotting_popgen_stats.R
to visualize population genetic summary statistics in candidate trait loci, along with additional statistics to detect signatures of selection (see below).
Run code blocks at the end of ~/hirundo_speciation_genomics/R/gemma_LMM.R
to summarize, plot, and statistically compare Fst distributions between significant and non-significant GWA SNPs.
It may be useful to further clarify which population(s) have been the targets of selection using population branch statistics to examine cases when one population has a much longer locus/region-specific branch length than the others, following Yi et al. 2010: PBS = (T12 + T13 - T23)/2, where T = -log(1-Fst) for a given pair of populations.
We have Fst data required to calculate PBS in sliding windows from pixy
. Run ~/hirundo_speciation_genomics/R/pbs.R
to perform calculations (also in ~/hirundo_speciation_genomics/R/candidate_plotting_popgen_stats.R
).
We'll look for fluctuations in the allele frequency spectrum using Tajima's D, comparing Tajima and Watterson's θ. We'll estimate Tajima's D using VCF-kit
, which has the ability to perform sliding window scans of the statistic. We do not want to limit analyses to SNPs with minor allele frequency cutoffs, since this will bias Watterson's estimator.
cd ~/hirundo_speciation_genomics/tmp/
mkdir vcf-kit-install
cd vcf-kit-install
virtualenv -p /usr/bin/python3.8 vcf-kit-env
source ./vcf-kit-env/bin/activate
pip install VCF-kit
cd ./analysis/
mkdir tajima
cd tajima
mkdir vcf
mkdir results
mkdir log
popmap.rustica
popmap.tytleri
popmap.gutturalis
chrom.list
bcftools view --threads 8 -S popmap.rustica -O z -o ./vcf/hirundo_rustica.parental.rustica.snps.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.vcf.gz
bcftools view --threads 8 -S popmap.tytleri -O z -o ./vcf/hirundo_rustica.parental.tytleri.snps.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.vcf.gz
bcftools view --threads 8 -S popmap.gutturalis -O z -o ./vcf/hirundo_rustica.parental.gutturalis.snps.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto.snps.miss02.vcf.gz
bcftools view --threads 8 -S popmap.rustica -O z -o ./vcf/hirundo_rustica.parental.rustica.snps.chrZ.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.vcf.gz
bcftools view --threads 8 -S popmap.tytleri -O z -o ./vcf/hirundo_rustica.parental.tytleri.snps.chrZ.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.vcf.gz
bcftools view --threads 8 -S popmap.gutturalis -O z -o ./vcf/hirundo_rustica.parental.gutturalis.snps.chrZ.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.chrZ.snps.miss02.vcf.gz
Run runTajima.sh
to calculate Tajima's D genome-wide.
source ~/hirundo_speciation_genomics/tmp/vcf-kit-install/vcf-kit-env/bin/activate
sh runTajima.sh
bcftools view --threads 16 -r NC_053453.1 -O z ./vcf/hirundo_rustica.parental.rustica.snps.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.rustica.chr1A.50kb-5kb.txt
bcftools view --threads 16 -r NC_053453.1 -O z ./vcf/hirundo_rustica.parental.tytleri.snps.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.tytleri.chr1A.50kb-5kb.txt
bcftools view --threads 16 -r NC_053453.1 -O z ./vcf/hirundo_rustica.parental.gutturalis.snps.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.gutturalis.chr1A.50kb-5kb.txt
bcftools view --threads 16 -r NC_053488.1 -O z ./vcf/hirundo_rustica.parental.rustica.snps.chrZ.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.rustica.chrZ.50kb-5kb.txt
bcftools view --threads 16 -r NC_053488.1 -O z ./vcf/hirundo_rustica.parental.tytleri.snps.chrZ.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.tytleri.chrZ.50kb-5kb.txt
bcftools view --threads 16 -r NC_053488.1 -O z ./vcf/hirundo_rustica.parental.gutturalis.snps.chrZ.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.gutturalis.chrZ.50kb-5kb.txt
bcftools view --threads 16 -r NC_053450.1 -O z ./vcf/hirundo_rustica.parental.rustica.snps.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.rustica.chr2.50kb-5kb.txt
bcftools view --threads 16 -r NC_053450.1 -O z ./vcf/hirundo_rustica.parental.tytleri.snps.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.tytleri.chr2.50kb-5kb.txt
bcftools view --threads 16 -r NC_053450.1 -O z ./vcf/hirundo_rustica.parental.gutturalis.snps.vcf.gz | vk tajima 50000 5000 - > ./results/tajima.gutturalis.chr2.50kb-5kb.txt
We'll use haplotype diversity statistics to identify regions with patterns consistent with selection. First, we will phase variants for the parental populations. We'll then use the R package rehh
to quantify haplotype statistics.
cd ./analysis/
mkdir rehh
cd rehh
mkdir shapeit
cd shapeit
mkdir input
mkdir pirs
mkdir log
mkdir results
We'll perform statistical read-backed phasing using SHAPEIT2
.
wget https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.v2.r904.glibcv2.17.linux.tar.gz
tar -xf shapeit.v2.r904.glibcv2.17.linux.tar.gz
rm shapeit.v2.r904.glibcv2.17.linux.tar.gz
mv shapeit.v2.904.3.10.0-693.11.6.el7.x86_64 shapeit2
wget https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/files/extractPIRs.v1.r68.x86_64.tgz
tar -xf extractPIRs.v1.r68.x86_64.tgz
rm extractPIRs.v1.r68.x86_64.tgz
mv extractPIRs.v1.r68.x86_64 extractPIRs
- Format population maps.
popmap.rustica
popmap.tytleri
popmap.gutturalis
- Concatenate popmaps.
cat popmap.rustica popmap.tytleri popmap.gutturalis > popmap.all
- Format
chromosome-scaffold.ordered.table.txt
andchromosome-scaffold.table.txt
. - Run
parseParentalVCF.sh
to extract chromosome-specific VCFs for each parental population.
sh parseParentalVCF.sh chromosome-scaffold.ordered.table.txt
gunzip ./input/*.vcf.gz
- Format bam list input files for
extractPIRs
.
sh makeBamlist.sh
rm bamlist.chr*-un*
- Format
chrom.list
. - Run
runExtractPIRs.sh
to extract phase informative reads.
sh runExtractPIRs.sh chrom.list
We'll run SHAPEIT2
using the settings:
- states = 1000
- burn = 200
- prune = 210
- main = 1000
- force
- Run
runShapeitAssemble.sh
to assemble haplotypes.
sh runShapeitAssemble.sh chrom.list
- Run
runShapeitConvert.sh
to convert haplotype output to VCF.
sh runShapeitConvert.sh chrom.list
- Run
runCompressIndexVCFs.sh
to compress and index results.
sh runCompressIndexVCFs.sh chrom.list
Run parsePhasedVCFs.sh
to extract parental population-specific VCFs.
sh parsePhasedVCFs.sh chrom.list
We now have input data that we can analyze using rehh
.
cd ./analysis/rehh/
Run runScans.sh
to call the R script rehhScans.R
on each chromosome, perform haplotype scans, and calculate iHS and xp-EHH.
sh runScans.sh
Concatenate results.
for pop in rustica tytleri gutturalis; do head -n1 ./results/iHS_chr1_${pop}.txt > ./results/iHS_all_${pop}.txt; for chrom in `cat chrom.list`; do tail -n+2 ./results/iHS_${chrom}_${pop}.txt >> ./results/iHS_all_${pop}.txt; done; done
for pop in rustica-tytleri rustica-gutturalis tytleri-gutturalis; do head -n1 ./results/XP-EHH_chr1_${pop}.txt > ./results/XP-EHH_all_${pop}.txt; for chrom in `cat chrom.list`; do tail -n+2 ./results/XP-EHH_${chrom}_${pop}.txt >> ./results/XP-EHH_all_${pop}.txt; done; done
Genotype x phenotype and population genetic analyses revealed a set of candidate genomic regions that are both strongly associated with ventral plumage coloration or tail streamer lengths and that also exhibit local signatures of divergent selection between parental populations. To examine further evidence of divergent selection and the possibility that these regions promote reproductive isolation between parental populations, we'll investigate geographic clines based on hybrid indices derived from trait loci versus the genome background.
cd ./analysis/
mkdir hzar
cd hzar
mkdir input
We'll first fit geographic cline models to ancestry estimates at loci associated with ventral color and tail streamer length.
Regions containing trait loci:
- KITLG region of Chr 1A; NC_053453.1:43850000-44600000
- PLXNC1 region of Chr 1A; NC_053453.1:46450000-46600000
- SPEF2/PRLR region of Z chr; NC_053488.1:18800000-19100000
- SLC45A2 region of Z chr; NC_053488.1:19300000-19400000
- BNC2 region of Z chr; NC_053488.1:33000000-33500000
- GNAQ region of Z chr; NC_053488.1:35750000-36250000
- ROR2 region of Z chr; NC_053488.1:44050000-44200000
- APC/CAMK4 region of Z chr; NC_053488.1:46000000-47000000
- ICE1/lncRNA region of Chr 2; NC_053450.1:97800000-98200000
- PDE1C region of Chr 2; NC_053450.1:100000000-101000000
- Retrieve popmaps from
introgress
analysis.
cp ../introgress/popmap.* .
- Extract SNPs for each hybrid zone and set of parental populations with minor allele frequency >= 0.1.
bcftools view --threads 16 -q 0.1:minor -S popmap.rustica-hybrids-tytleri -O z -o ./tmp.snps.maf1.rustica-tytleri_hybrids.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -q 0.1:minor -S popmap.rustica-hybrids-gutturalis -O z -o ./tmp.snps.maf1.rustica-gutturalis_hybrids.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
bcftools view --threads 16 -q 0.1:minor -S popmap.tytleri-hybrids-gutturalis -O z -o ./tmp.snps.maf1.tytleri-gutturalis_hybrids.vcf.gz ~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz
- Run scripts to extract SNPs in focal regions for each hybrid zone.
sh extractSNPs_trait_rustica-tytleri.sh
sh extractSNPs_trait_rustica-gutturalis.sh
sh extractSNPs_trait_tytleri-gutturalis.sh
Run these analyses in ~/hirundo_speciation_genomics/R/hzar_trait_rustica-tytleri.R
, ~/hirundo_speciation_genomics/R/hzar_trait_rustica-gutturalis.R
, and ~/hirundo_speciation_genomics/R/hzar_trait_tytleri-gutturalis.R
.
Run these analyses in ~/hirundo_speciation_genomics/R/hzar_trait_rustica-tytleri.R
, ~/hirundo_speciation_genomics/R/hzar_trait_rustica-gutturalis.R
, and ~/hirundo_speciation_genomics/R/hzar_trait_tytleri-gutturalis.R
.
We want to fit geographic clines to a background distribution of loci that makes biological sense, allowing for variation in evolutionary processes and their effects across loci. Inspired by helpful suggestions from Sean Stankowski (and his analyses of Mimulus), we'll fit clines to a sample of loci from across the genome, then summarize genome-wide variation in cline parameters (thanks, Sean!).
cd ./analysis/hzar/
mkdir input_background_introgress
mkdir input_background_hzar
mkdir results_introgress
mkdir results_hzar
mkdir log
Format metadata tables for each hybrid zone:
data.rustica-tytleri.txt
data.rustica-gutturalis.txt
data.tytleri-gutturalis.txt
- Format
candidate.bed
with regions of trait loci. - Make lists of all SNPs per hybrid zone (to be queried below).
bcftools query -f '%CHROM\t%POS\n' tmp.snps.maf1.rustica-tytleri_hybrids.vcf.gz | grep 'NC_' > all.rustica-tytleri.snps.maf1.txt
bcftools query -f '%CHROM\t%POS\n' tmp.snps.maf1.rustica-gutturalis_hybrids.vcf.gz | grep 'NC_' > all.rustica-gutturalis.snps.maf1.txt
bcftools query -f '%CHROM\t%POS\n' tmp.snps.maf1.tytleri-gutturalis_hybrids.vcf.gz | grep 'NC_' > all.tytleri-gutturalis.snps.maf1.txt
- Make lists of 1000 randomly sampled focal SNPs (minor allele frequency >= 0.1).
bcftools query -f '%CHROM\t%POS\n' tmp.snps.maf1.rustica-tytleri_hybrids.vcf.gz | grep 'NC_' | awk '{OFS="\t"}{print $1,$2-1,$2}' | bedtools intersect -v -wb -a - -b candidate.bed | shuf -n 1000 | awk '{OFS="\t"}{print $1,$3}' > background.rustica-tytleri.snps.maf1.txt
bcftools query -f '%CHROM\t%POS\n' tmp.snps.maf1.rustica-gutturalis_hybrids.vcf.gz | grep 'NC_' | awk '{OFS="\t"}{print $1,$2-1,$2}' | bedtools intersect -v -wb -a - -b candidate.bed | shuf -n 1000 | awk '{OFS="\t"}{print $1,$3}' > background.rustica-gutturalis.snps.maf1.txt
bcftools query -f '%CHROM\t%POS\n' tmp.snps.maf1.tytleri-gutturalis_hybrids.vcf.gz | grep 'NC_' | awk '{OFS="\t"}{print $1,$2-1,$2}' | bedtools intersect -v -wb -a - -b candidate.bed | shuf -n 1000 | awk '{OFS="\t"}{print $1,$3}' > background.tytleri-gutturalis.snps.maf1.txt
- Run
makeSNPbackground.sh
to extract loci consisting of 100 SNPs around each focal SNP.
sh makeSNPbackground.sh rustica-tytleri
sh makeSNPbackground.sh rustica-gutturalis
sh makeSNPbackground.sh tytleri-gutturalis
- Index input VCFs.
for i in ./*.vcf.gz; do tabix -p vcf $i; done
- Run
parseBackgroundVCFs.sh
to extract SNP data for background loci.
sh parseBackgroundVCFs.sh
- Check that all VCFs consist of loci with 100 SNPs.
for vcf in ./input_background_introgress/*.vcf.gz; do bcftools query -f '%POS\n' $vcf | wc -l; done
Now we'll run analysis on each of the input locus VCFs to prepare inputs for HZAR.
Run wrapper scripts runIntrogress_rustica-tytleri.sh
, runIntrogress_rustica-gutturalis.sh
, and runIntrogress_tytleri-gutturalis.sh
, to perform analysis in introgress
per locus by calling introgress_rustica-tytleri.R
, introgress_rustica-gutturalis.R
, and introgress_tytleri-gutturalis.R
, respectively.
sh runIntrogress_rustica-tytleri.sh
sh runIntrogress_rustica-gutturalis.sh
sh runIntrogress_tytleri-gutturalis.sh
This outputs the results to ./results_introgress/
.
Run makeInputHZAR.sh
to format input data for HZAR, along with a locus lookup table.
./makeInputHZAR.sh rustica-tytleri
./makeInputHZAR.sh rustica-gutturalis
./makeInputHZAR.sh tytleri-gutturalis
Note: we call this as a bash
script instead of with sh
for compatibility with internal paste
command.
Run wrapper scripts to call hzar_background_rustica-tytleri.R
, hzar_background_rustica-gutturalis.R
, and hzar_background_tytleri-gutturalis.R
per locus.
sh runHZAR_rustica-tytleri.sh
sh runHZAR_rustica-gutturalis.sh
sh runHZAR_tytleri-gutturalis.sh
These scripts output Rdata results files for each locus to ./hzar_results
.
Run ~/hirundo_speciation_genomics/R/hzar_summary.R
and /R/hzar_plotting
.
We'll examine introgression of trait loci compared to background loci also in the context of genome-wide ancestry using Bayesian genomic clines. We'll use a modification of the original bgc
software to perform analysis on hybrid index estimates per locus, accounting for female hemizygosity on the Z chromosome.
cd ./analysis/
mkdir bgc
cd bgc
mkdir input_locus
- Format popmaps, including sample, population, and sex data.
popmap.rustica-tytleri.txt
popmap.rustica-gutturalis.txt
popmap.tytleri-gutturalis.txt
- Retrieve background locus tables from
hzar
analysis directory.
cp ../hzar/*.locus-table.txt .
Run makeLocusInput.sh
to format input data per background locus.
./makeLocusInput.sh rustica-tytleri
./makeLocusInput.sh rustica-gutturalis
./makeLocusInput.sh tytleri-gutturalis
Inputs for the trait loci can be formatted using the popmaps and locus-specific data in ../hzar/data.*.csv
files.
We want to know whether alleles in outlier regions associated with the divergent traits in barn swallows and which resist gene flow are statistically associated (i.e., in linkage disequilibrium) with each other, and whether this genetic coupling has been maintained by selection over time in hybrid zones. We expect that this signature will pop up in the hybrid populations, where different recombinants of the trait loci are being tested by selection, which has favored certain combinations of alleles at the different regions, producing long range and/or interchromosomal LD.
A problem that this presents is that LD is automatically generated as a consequence of admixture in hybrid zones, and will be especially prevalent in genomic regions that are highly-differentiated between parental populations. So, evidence for coupling between the trait loci requires an appropriate background for comparison, since allele frequency differences are higher in these regions than elsewhere in the genome.
The solution is to compare LD in hybrids between SNPs in candidate regions in bins of allele frequency differences between parental populations to LD between random SNPs within the same allele frequency differences. This allows a fair evaluation of whether LD between the candidate regions is truly high relative to a background with matched allele frequency differences, providing evidence for selection for coupling in trait loci.
cd ./analysis/
mkdir ld
cd ld
mkdir vcf
mkdir afd
mkdir r2
mkdir r2_pairwise
- Format population lists.
grep 'RU' ../pixy/popmap.pixy | cut -f1 > popmap.rustica
grep 'TY' ../pixy/popmap.pixy | cut -f1 > popmap.tytleri
grep 'GU' ../pixy/popmap.pixy | cut -f1 > popmap.gutturalis
grep 'RT' ../pixy/popmap.pixy | cut -f1 > popmap.rustica-tytleri
grep 'RG' ../pixy/popmap.pixy | cut -f1 > popmap.rustica-gutturalis
grep 'TG' ../pixy/popmap.pixy | cut -f1 > popmap.tytleri-gutturalis
- Format trait locus BED file
candidate.bed
.
java -Xmx96g -jar ../gemma/beagle.28Jun21.220.jar nthreads=24 gt=~/hirundo_speciation_genomics/vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.vcf.gz out=./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased impute=false
tabix -p vcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz
We'll use these results to parse sets of SNPs within bins of allele frequency differences.
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rustica --freq --out ./afd/allele-freq.all.rustica
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.tytleri --freq --out ./afd/allele-freq.all.tytleri
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.gutturalis --freq --out ./afd/allele-freq.all.gutturalis
We'll fix the header line and data entries to split the allele from the frequency for each allele.
echo -e "CHROM\tPOS\tN_ALLELES\tN_CHR\tALLELE1\tFREQ1\tALLELE2\tFREQ2" > ./afd/allele-freq.all.rustica.txt; tr ':' '\t' < ./afd/allele-freq.all.rustica.frq | tail -n+2 >> ./afd/allele-freq.all.rustica.txt
echo -e "CHROM\tPOS\tN_ALLELES\tN_CHR\tALLELE1\tFREQ1\tALLELE2\tFREQ2" > ./afd/allele-freq.all.tytleri.txt; tr ':' '\t' < ./afd/allele-freq.all.tytleri.frq | tail -n+2 >> ./afd/allele-freq.all.tytleri.txt
echo -e "CHROM\tPOS\tN_ALLELES\tN_CHR\tALLELE1\tFREQ1\tALLELE2\tFREQ2" > ./afd/allele-freq.all.gutturalis.txt; tr ':' '\t' < ./afd/allele-freq.all.gutturalis.frq | tail -n+2 >> ./afd/allele-freq.all.gutturalis.txt
Run ~/hirundo_speciation_genomics/R/ld_allele-freq-diff.R
. This outputs text files with allele frequency differences between pairs of populations.
- Parse data in trait loci.
echo -e "CHROM\tPOS\tAFD" > ./afd/allele-freq-diff.candidate.rustica-tytleri.txt; tail -n+2 ./afd/allele-freq-diff.all.rustica-tytleri.txt | awk '{OFS="\t"}{print $1,$2-1,$2,$3}' | bedtools intersect -a - -b candidate.bed | awk '{OFS="\t"}{print $1,$3,$4}' >> ./afd/allele-freq-diff.candidate.rustica-tytleri.txt
echo -e "CHROM\tPOS\tAFD" > ./afd/allele-freq-diff.candidate.rustica-gutturalis.txt; tail -n+2 ./afd/allele-freq-diff.all.rustica-gutturalis.txt | awk '{OFS="\t"}{print $1,$2-1,$2,$3}' | bedtools intersect -a - -b candidate.bed | awk '{OFS="\t"}{print $1,$3,$4}' >> ./afd/allele-freq-diff.candidate.rustica-gutturalis.txt
echo -e "CHROM\tPOS\tAFD" > ./afd/allele-freq-diff.candidate.tytleri-gutturalis.txt; tail -n+2 ./afd/allele-freq-diff.all.tytleri-gutturalis.txt | awk '{OFS="\t"}{print $1,$2-1,$2,$3}' | bedtools intersect -a - -b candidate.bed | awk '{OFS="\t"}{print $1,$3,$4}' >> ./afd/allele-freq-diff.candidate.tytleri-gutturalis.txt
- Parse data in the background.
echo -e "CHROM\tPOS\tAFD" > ./afd/allele-freq-diff.background.rustica-tytleri.txt; tail -n+2 ./afd/allele-freq-diff.all.rustica-tytleri.txt | awk '{OFS="\t"}{print $1,$2-1,$2,$3}' | bedtools intersect -v -a - -b candidate.bed | awk '{OFS="\t"}{print $1,$3,$4}' >> ./afd/allele-freq-diff.background.rustica-tytleri.txt
echo -e "CHROM\tPOS\tAFD" > ./afd/allele-freq-diff.background.rustica-gutturalis.txt; tail -n+2 ./afd/allele-freq-diff.all.rustica-gutturalis.txt | awk '{OFS="\t"}{print $1,$2-1,$2,$3}' | bedtools intersect -v -a - -b candidate.bed | awk '{OFS="\t"}{print $1,$3,$4}' >> ./afd/allele-freq-diff.background.rustica-gutturalis.txt
echo -e "CHROM\tPOS\tAFD" > ./afd/allele-freq-diff.background.tytleri-gutturalis.txt; tail -n+2 ./afd/allele-freq-diff.all.tytleri-gutturalis.txt | awk '{OFS="\t"}{print $1,$2-1,$2,$3}' | bedtools intersect -v -a - -b candidate.bed | awk '{OFS="\t"}{print $1,$3,$4}' >> ./afd/allele-freq-diff.background.tytleri-gutturalis.txt
This is also done in ~/hirundo_speciation_genomics/R/ld_allele-freq-diff.R
. This outputs data for matched candidate & background SNPs in allele frequency difference bins with 0.05 increments. These are in the ./afd/
subdirectory.
We'll measure haplotype r2 between SNPs in matched bins of allele frequency differences for trait loci and the genome background.
Note: r2 is influenced by sample size, so we will randomly sample 20 individuals per population for analysis.
shuf -n 20 popmap.rustica > popmap.rand.rustica
shuf -n 20 popmap.tytleri > popmap.rand.tytleri
shuf -n 20 popmap.gutturalis > popmap.rand.gutturalis
shuf -n 20 popmap.rustica-tytleri > popmap.rand.rustica-tytleri
shuf -n 20 popmap.rustica-gutturalis > popmap.rand.rustica-gutturalis
shuf -n 20 popmap.tytleri-gutturalis > popmap.rand.tytleri-gutturalis
- In trait loci bins.
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.rustica-tytleri.bin$bin.txt | cut -f1,2 > tmp.rt-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-tytleri --positions tmp.rt-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.rustica-tytleri; done; rm tmp.rt-cand.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.rustica-gutturalis.bin$bin.txt | cut -f1,2 > tmp.rg-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-gutturalis --positions tmp.rg-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.rustica-gutturalis; done; rm tmp.rg-cand.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.tytleri-gutturalis.bin$bin.txt | cut -f1,2 > tmp.tg-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.tytleri-gutturalis --positions tmp.tg-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.tytleri-gutturalis; done; rm tmp.tg-cand.position.txt
- In background bins.
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.background.rustica-tytleri.bin$bin.txt | cut -f1,2 > tmp.rt-back.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-tytleri --positions tmp.rt-back.position.txt --interchrom-hap-r2 --out ./r2/r2.background.AFD-bin$bin.rustica-tytleri; done; rm tmp.rt-back.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.background.rustica-gutturalis.bin$bin.txt | cut -f1,2 > tmp.rg-back.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-gutturalis --positions tmp.rg-back.position.txt --interchrom-hap-r2 --out ./r2/r2.background.AFD-bin$bin.rustica-gutturalis; done; rm tmp.rg-back.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.background.tytleri-gutturalis.bin$bin.txt | cut -f1,2 > tmp.tg-back.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.tytleri-gutturalis --positions tmp.tg-back.position.txt --interchrom-hap-r2 --out ./r2/r2.background.AFD-bin$bin.tytleri-gutturalis; done; rm tmp.tg-back.position.txt
- In trait loci bins.
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.rustica-tytleri.bin$bin.txt | cut -f1,2 > tmp.ru-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica --positions tmp.ru-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.rustica; done; rm tmp.ru-cand.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.rustica-tytleri.bin$bin.txt | cut -f1,2 > tmp.ty-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.tytleri --positions tmp.ty-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.tytleri; done; rm tmp.ty-cand.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.rustica-gutturalis.bin$bin.txt | cut -f1,2 > tmp.gu-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.gutturalis --positions tmp.gu-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.gutturalis; done; rm tmp.gu-cand.position.txt
- In background bins.
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.background.rustica-tytleri.bin$bin.txt | cut -f1,2 > tmp.ru-back.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica --positions tmp.ru-back.position.txt --interchrom-hap-r2 --out ./r2/r2.background.AFD-bin$bin.rustica; done; rm tmp.ru-back.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.background.rustica-tytleri.bin$bin.txt | cut -f1,2 > tmp.ty-back.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.tytleri --positions tmp.ty-back.position.txt --interchrom-hap-r2 --out ./r2/r2.background.AFD-bin$bin.tytleri; done; rm tmp.ty-back.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.background.rustica-gutturalis.bin$bin.txt | cut -f1,2 > tmp.gu-back.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.gutturalis --positions tmp.gu-back.position.txt --interchrom-hap-r2 --out ./r2/r2.background.AFD-bin$bin.gutturalis; done; rm tmp.gu-back.position.txt
Here, we want to approximate initial admixture LD in hybrids in the candidate regions, for comparison to the actual hybrids. Because we've sampled a random 20 individuals per pop, we'll sample 10 individuals from each pair of populations to make a simulated 'F1' pop.
- Format simulated hybrid population lists.
shuf -n 10 popmap.rustica > popmap.sim.rustica-tytleri; shuf -n 10 popmap.tytleri >> popmap.sim.rustica-tytleri
shuf -n 10 popmap.rustica > popmap.sim.rustica-gutturalis; shuf -n 10 popmap.gutturalis >> popmap.sim.rustica-gutturalis
shuf -n 10 popmap.tytleri > popmap.sim.tytleri-gutturalis; shuf -n 10 popmap.gutturalis >> popmap.sim.tytleri-gutturalis
- Calculate interchromosomal r2 in trait loci.
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.rustica-tytleri.bin$bin.txt | cut -f1,2 > tmp.sim.rt-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.sim.rustica-tytleri --positions tmp.sim.rt-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.sim.rustica-tytleri; done; rm tmp.sim.rt-cand.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.rustica-gutturalis.bin$bin.txt | cut -f1,2 > tmp.sim.rg-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.sim.rustica-gutturalis --positions tmp.sim.rg-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.sim.rustica-gutturalis; done; rm tmp.sim.rg-cand.position.txt
for bin in 2 25 3 35 4 45 5 55 6; do tail -n+2 ./afd/allele-freq-diff.candidate.tytleri-gutturalis.bin$bin.txt | cut -f1,2 > tmp.sim.tg-cand.position.txt; vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.sim.tytleri-gutturalis --positions tmp.sim.tg-cand.position.txt --interchrom-hap-r2 --out ./r2/r2.candidate.AFD-bin$bin.sim.tytleri-gutturalis; done; rm tmp.sim.tg-cand.position.txt
To compare distributions of interchromosomal r2, perform statistical summaries, and plot results, run ~/hirundo_speciation_genomics/R/ld_plotting_summary.R
.
To evaluate whether certain pairs of trait loci are in higher LD than others, we'll calculate inter- and intrachromosomal haplotype r2 between all pairs of trait loci in each hybrid zone across SNPs with parental allele frequency differences between 0.3 and 0.6.
- Create temporary headerless files for allele frequency difference bins.
for i in 3 35 4 45 5 55 6; do tail -n +2 ./afd/allele-freq-diff.candidate.rustica-tytleri.bin${i}.txt > ./tmp.allele-freq-diff.candidate.rustica-tytleri.bin${i}.txt; done
for i in 3 35 4 45 5 55 6; do tail -n +2 ./afd/allele-freq-diff.candidate.rustica-gutturalis.bin${i}.txt > ./tmp.allele-freq-diff.candidate.rustica-gutturalis.bin${i}.txt; done
for i in 3 35 4 45 5 55; do tail -n +2 ./afd/allele-freq-diff.candidate.tytleri-gutturalis.bin${i}.txt > ./tmp.allele-freq-diff.candidate.tytleri-gutturalis.bin${i}.txt; done
- Concatenate the bins and clean up.
cat tmp.allele-freq-diff.candidate.rustica-tytleri.bin*.txt | cut -f1,2 > candidate.afd_3-6.rustica-tytleri.txt
cat tmp.allele-freq-diff.candidate.rustica-gutturalis.bin*.txt | cut -f1,2 > candidate.afd_3-6.rustica-gutturalis.txt
cat tmp.allele-freq-diff.candidate.tytleri-gutturalis.bin*.txt | cut -f1,2 > candidate.afd_3-6.tytleri-gutturalis.txt
rm tmp.allele-freq-diff.*
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-tytleri --positions candidate.afd_3-6.rustica-tytleri.txt --hap-r2 --out ./r2_pairwise/r2.candidate.afd.rustica-tytleri
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-tytleri --positions candidate.afd_3-6.rustica-tytleri.txt --interchrom-hap-r2 --out ./r2_pairwise/r2.candidate.afd.rustica-tytleri
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-gutturalis --positions candidate.afd_3-6.rustica-gutturalis.txt --hap-r2 --out ./r2_pairwise/r2.candidate.afd.rustica-gutturalis
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.rustica-gutturalis --positions candidate.afd_3-6.rustica-gutturalis.txt --interchrom-hap-r2 --out ./r2_pairwise/r2.candidate.afd.rustica-gutturalis
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.tytleri-gutturalis --positions candidate.afd_3-6.tytleri-gutturalis.txt --hap-r2 --out ./r2_pairwise/r2.candidate.afd.tytleri-gutturalis
vcftools --gzvcf ./vcf/hirundo_rustica+smithii.allsites.final.auto+chrZ.snps.miss02.maf05.ingroup.phased.vcf.gz --keep popmap.rand.tytleri-gutturalis --positions candidate.afd_3-6.tytleri-gutturalis.txt --interchrom-hap-r2 --out ./r2_pairwise/r2.candidate.afd.tytleri-gutturalis
To compare distributions of interchromosomal r2, perform statistical summaries, and plot results, run ~/hirundo_speciation_genomics/R/ld_pairwise_candidate.R
.
For comparison to interchromosomal patterns, and as a benchmark for the relationship between linkage disequilibrium and physical linkage, we'll examine the decay of LD with distance between SNPs on the same chromosome. We'll examine LD decay in each of the hybrid zones and parental populations, and also make comparisons between genome-wide patterns and decay on the Z chromosome, specifically.
cd ./analysis/
mkdir ld_decay
cd ld_decay
mkdir r2
mkdir summary
cp ../ld/popmap.*
- Format 100 kb genome-wide windows.
cat ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.auto.genome ~/hirundo_speciation_genomics/Hirundo_rustica_bHirRus1.final.chrZ.genome | grep -v 'NW_' | bedtools makewindows -g - -w 100000 > Hirundo_rustica_bHirRus1.final.window_100kb.bed
- Randomly sample 100 autosomal windows.
echo -e "chrom\tchromStart\tchromEnd" > Hirundo_rustica_bHirRus1.final.window_100kb.rand.bed; grep -v 'NC_053488.1' Hirundo_rustica_bHirRus1.final.window_100kb.bed | shuf -n 100 | bedtools sort -i - >> Hirundo_rustica_bHirRus1.final.window_100kb.auto.rand.bed
Run runHaplotypeR2.sh
to calculate r2 between SNPs within 25 kb on autosomes and the Z chromosome.
sh runHaplotypeR2.sh
Run runSummarizeR2.sh
to summarize values by physical distance.
sh runSummarizeR2.sh
Run ld_decay_summary.R
. This script reads in the summary output above and produces a table of summary statistics in distance bins (the number and size of these are set as the test.range
variable).
We'll run runDecay.sh
to call the R script for autosomes and the Z chromosome for each population.
sh runDecay.sh
Run ~/hirundo_speciation_genomics/R/ld_decay_plotting.R
to summarize/plot the results.
The B10K barn swallow reference genome includes chromosome-length scaffolds, but which are not assigned to the passerine karyotype. Here, we'll establish synteny between scaffolds and chromosome-assigned scaffolds in the zebra finch reference genome, and reformat the barn swallow scaffolds according to these assignments.
mkdir reference
cd reference
mkdir reference_B10K
mkdir reference_taeniopygia
cd reference_B10K
mkdir ncbi_version
cd ncbi_version
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/227/805/GCF_015227805.1_bHirRus1.pri.v2/GCF_015227805.1_bHirRus1.pri.v2_genomic.fna.gz
gunzip GCF_015227805.1_bHirRus1.pri.v2_genomic.fna.gz
cd ../..
cd reference_taeniopygia
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/008/822/105/GCF_008822105.2_bTaeGut2.pat.W.v2/GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna.gz
gunzip GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna.gz
cd ..
cd reference_taeniopygia
Run fasta_seq_length.py
to calculate scaffold lengths.
python fasta_seq_length.py GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna > GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.scaffold_lengths.txt
Make list of chromosome-assigned scaffolds GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.chrom_scaffolds.list
.
Extract chromosome-assigned scaffolds using scaffold_list_extractor.py
.
python scaffold_list_extractor.py GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.fna GCF_008822105.2_bTaeGut2.pat.W.v2_genomic.chrom_scaffolds.list taeniopygia_guttata_ChromAssigned.fasta
cd ..
We'll use MashMap to generate alignments between the reference genomes.
cd reference_B10K
mkdir mashmap_results
cd mashmap_results
mashmap -t 4 -r ../../reference_taeniopygia/taeniopygia_guttata_ChromAssigned.fasta -q ../ncbi_version/GCF_015227805.1_bHirRus1.pri.v2_genomic.fna -f one-to-one -s 50000 --pi 90 -o B10K_hirundo_2_zebra_finch.txt
cd ..
Format B10K_scaffold-chromosome_table.txt
.
Format taeniopygia_scaffold-chromosome_table.txt
Based on MashMap output, the related scaffolds are in B10K_barn_swallow_reformat_table.txt
.
Run reformat_B10K_scaffolds.py
to reformat reference.
python reformat_B10K_scaffolds.py B10K_barn_swallow_reformat_table.txt ncbi_version/GCF_015227805.1_bHirRus1.pri.v2_genomic.fna Hirundo_rustica_bHirRus1.final.fasta
cd ..
This outputs the original B10K scaffolds, but with correctly-assigned chromosome IDs, and in order of correct chromosome assignments. The chromosome-assigned scaffolds are first, followed by unplaced scaffolds.
Here, we'll retrieve the reference to the main working directory and prepare indexes for various analyses (e.g., bwa, GATK, samtools).
cd ~/hirundo_speciation_genomics
mv reference/reference_B10K/Hirundo_rustica_bHirRus1.final.fasta .
bwa index Hirundo_rustica_bHirRus1.final.fasta
./gatk-4.0.8.1/gatk CreateSequenceDictionary -R Hirundo_rustica_bHirRus1.final.fasta
samtools faidx Hirundo_rustica_bHirRus1.final.fasta
mkdir genome_annotation
cd genome_annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/227/805/GCF_015227805.1_bHirRus1.pri.v2/GCF_015227805.1_bHirRus1.pri.v2_genomic.gff.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/227/805/GCF_015227805.1_bHirRus1.pri.v2/GCF_015227805.1_bHirRus1.pri.v2_protein.faa.gz
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/015/227/805/GCF_015227805.1_bHirRus1.pri.v2/GCF_015227805.1_bHirRus1.pri.v2_rna.fna.gz
gunzip *.gz
cd ..
We want to avoid analyzing genotypes in repetitive regions of the genome. We'll generate a repeat annotation for filtering using repeatmasker
.
The dependencies are rmblast
, tandem repeats finder
, and h5py
.
rmblast:
cd ./tmp/
wget http://www.repeatmasker.org/rmblast-2.11.0+-x64-linux.tar.gz
tar -xf rmblast-2.11.0+-x64-linux.tar.gz
sudo cp -r rmblast-2.11.0 /usr/local/
Download tandem repeats finder.
sudo cp trf409.linux64 /usr/local/
h5py:
conda create --name h5py python=3.5
conda deactivate
conda activate h5py
conda install h5py
Set up and unpack distribution.
cd ./tmp/
wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.2-p1.tar.gz
tar -xf RepeatMasker-4.1.2-p1.tar.gz
sudo cp -r RepeatMasker /usr/local/
cd /usr/local/RepeatMasker
Download expanded repeat library.
sudo wget https://www.dfam.org/releases/Dfam_3.2/families/Dfam.h5.gz
sudo gunzip Dfam.h5.gz
sudo mv Dfam.h5 ./Libraries
Configure.
sudo perl ./configure
These paths need to be entered during the configuration:
/usr/local/trf409.linux64
/usr/local/rmblast-2.11.0/bin
Run repeatmasker.
cd /data3/hirundo/genome_annotation/
mkdir repeatmasker
cd repeatmasker
/usr/local/RepeatMasker/RepeatMasker -pa 32 -xsmall -species aves ../../Hirundo_rustica_bHirRus1.final.fasta -dir .
Convert .out file to BED format.
tail -n+4 Hirundo_rustica_bHirRus1.final.fasta.out | awk 'BEGIN{OFS="\t"}{print $5,$6-1,$7}' | bedtools sort -i - > Hirundo_rustica_bHirRus1.final.fasta.repeat.bed