Bioinformatic pipeline for ABBA-BABA analyses of genomic allele frequency data of Sepsis cynipsea and S. neocynipsea in Europe

(1) Trimming of raw reads, which are deposited on the SRA archive under PRJNA612154 was carried out with trimmomatic.

java -jar trimmomatic-0.36.jar \
PE -threads 20 \
input_R1.fastq.gz \
input_R2.fastq.gz \
output_R1.fastq.gz \
output_R1_un.fastq.gz \
output_R2.fastq.gz \
output_R2_un.fastq.gz \
ILLUMINACLIP:AdapterSeq_new.fa:2:30:7:5:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:40

(2) Mapping of the trimmed reads was carried out with bwa mem following the pipeline in Kapun et al. 2020 as described in here using the reference genome of S. thoracica, which can be found here.

(3) After merging the BAM files with samtools mpileup as described here, we called SNPs using the heuristic SNP caller PoolSNP using the following parameters.

sh PoolSNP.sh  \
mpileup=/ABBA_BABA.mpileup.gz \
reference=seto_01_genome.fasta \
names=PhC,PtC,SoC,ZuC,MoC,GeN,HoN,SoN,MoN,IZuC,IZuN,Sor  \
max-cov=0.9 \
min-cov=10 \
min-count=20 \
min-freq=0.01 \
miss-frac=0 \
jobs=10 \
base-quality=15 \
output=ABBA_BABA \
badsites=0

(4) Then, we converted the VCF generated by PoolSNP to the sync file format using a script from here, removed all sites with missing information and calculated allele frequencies for the overall major allele using a script of our new ABBA-BABA test implementation, which can be found here. The allele frequency file that was generated here can be found on Datadryad in here

gunzip -c ABBA_BABA.vcf.gz  | parallel \
  -k \
  --pipe \
  -j20 \
  --no-notice \
  --cat python /github/capoony/DrosEU_pipeline/VCF2sync.py \
  --input {} \
  | grep -v ".:.:.:.:.:." \
  | gzip > ABBA_BABA-filtered.sync.gz
gunzip -c ABBA_BABA-filtered.sync.gz \
  | parallel \
  -k \
  --pipe \
  -j 20 \
  --no-notice \
  --cat python /github/nhmvienna/ABBABABA-4AF/SYNC2AF.py \
  --sync {} \
  | gzip > Sepsid_ABBA-BABA.af.gz

(5) Finally, we conducted ABBA-BABA tests using the approaches of Durand et al. (2011) and Soraggi et al. (2018) using our new python implementation, which can be found here here.

Note that the column positions that describe the populations used as P1,P2,P3 and P4 and are passed to the script via the parameter --Order are 0-based, i.e. 3 = 4th column. The first three columns in the input file are Chromosome, Position and allelic state, respectively, and all populations are arranged as PhC,PtC,SoC,ZuC,MoC,GeN,HoN,SoN,MoN,IZuC,IZuN,Sor. Thus, the order 3,5,10,14 corresponds to: P1=PhC, P2=SoC, P3=SoN and P4=Sor. In the example below, D will be calculated in windows comprised of 500 consecutive SNPs. Please check the ABBABABA-4AF github page for more details.

python3 /github/nhmvienna/ABBABABA-4AF/ABBABABA-4AF.py \
--AF Sepsid_ABBA-BABA.af.gz \
--SNPs 500 \
--Order 3,5,10,14 \
--Output output-file