/Intro2PopGenomics

Leroy & Rougemont - “learning-by-doing” introduction to population genomics - scripts

Primary LanguageShell

Introduction to population genomics

This github directory hosts all scripts used to perform the analyses shown in in the book chapter of Leroy & Rougemont (2019).

Leroy, T & Rougemont, Q. 2019. Introduction to population genetics analysis methods. In: Molecular Plant Taxonomy (Springer)

Please send an email to us for questions regarding these scripts or for full-text requests.

Foreword

Table of contents:

The github repository follow the same organisation as in the book chapter.
Individual data (African rice analyses, 3.2)
3.2.2 : From raw data to VCF
3.2.3 - Population structure
3.2.4 - Nucleotide diversity
3.2.5 - Historical population sizes
3.2.6 - Deleterious mutation load
3.2.7 - Fst & genome scans
Pool-seq data (Sessile oak analyses, 3.3)
3.3.1 - Pool-seq vs. individual data
3.3.3 - From raw data to allele counts
3.3.4 - Population splits & mixtures
3.3.5 - Fst estimates
3.3.6 - Genome Scan for Selection
3.3.7 - Genotype-Environment associations

Important note

Our scripts are not standalone executables. Quite the contrary, these scripts (deliberately) require some simple edits to adjust to your data. Editing script is probably the best way to learn how a script works, to detect and correct the errors and, more broadly, to start learning how to code. So please take some time to read the scripts and, ideally, to briefly look at the software user manuals. Change the paths to files and programs to adjust the scripts to your computer architecture.

Overview

Individual data - Application to the African Rice data

3.2.2 : From raw data to VCF


1/Import sequencing data (./3.2.2/1-Import_RawData/)
Softwares needed: wget (ftp-transfert)
Here a list of all ftp addresses
e.g.wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR200/000/ERR2008850/ERR2008850_1.fastq.gz
    wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR200/000/ERR2008850/ERR2008850_2.fastq.gz

2/Read trimming (./3.2.2/2-Trimming/)
Softwares needed: Trimmomatic
java -Xmx4g -jar ./trimmomatic-0.33.jar PE -threads 1 -phred33 "[file]_1.fastq.gz [file]_2.fastq.gz [file]_1_cleaned.fastq.gz [file]_1_cleaned_unpaired.fastq.gz [file]_2_cleaned.fastq.gz [file]_2_cleaned_unpaired.fastq.gz ILLUMINACLIP:./adapters/TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

3/Downloading & indexing a reference genome (./3.2.2/3-Download_Index_references/)
Softwares needed: wget (ftp-transfert), BWA, Samtools & Picard (indexing) 
Asian rice genome: wget ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna_sm.toplevel.fa.gz
BWA: bwa index -a bwtsw [fasta]
Samtools: samtools faidx [fasta]
Picard: java -Xmx4g  -jar picard.jar  CreateSequenceDictionary REFERENCE=[fasta]  OUTPUT=[fasta].dict 

4/Mapping & Individual Calling (./3.2.2/4-PipelineMappingCalling/)
Softwares needed: BWA (mapping), Samtools (filtering), Picard (removing duplicates) & GATK (creating gVCF files)
bash 1_mapping.sh [RawData_Directory] [File_Trim_Paired_1] [File_Trim_Paired_2] [File_Trim_Unpaired_1] [File_Trim_Unpaired_2] [Reference_Genome] [Number_of_CPU_to_use]
bash 2_snpindel_callingGVCF.sh [ID] [Reference_Genome] [output_directory] [Number_of_CPU_to_use] 

5/Joint Genotyping (./3.2.2/5-Joint_genotyping/)
 bash 3_intervals_jointgenotyping.sh [reference_genome] [nb_cpus]  

6/Filtering variants (./3.2.2/6-Variant_Filtration/)
 ./VariantFiltrationVCF.py -q 2.0 -s 60.0 -m 40.0 -n -2.0 -r -2.0 -w 45000 -f [VCF] > [filtered_VCF] 

3.2.3 - Population structure


Perform Principal Component Analysis (./3.2.3/PCA/)
R packages needed: SNPRelate (& gdsfmt)
Details shown in script_PCA_from_vcf.R (Import & convert the vcf file, compute PCA & generate plots) 

Infer ancestry proportions (./3.2.3/fastStructure/)
Software needed: Plink & fastStructure
Generate an input file using Plink:
 ./plink --allow-extra-chr -allow-extra-chr --make-bed --noweb --out [VCF].bed --vcf [VCF] 
FastStructure inferences:
./structure.py -K [nb_of_clusters] --input [VCF].bed --output [VCF].bed.K[nb_of_clusters] --full --cv [cross-validation steps] --format bed
./chooseK.py --input=[VCF].bed.K 

3.2.4 - Nucleotide diversity


Generate Fasta files from VCF (./3.2.4/Scripts_generate_fasta_sequences_from_vcf/)
Scripts were developed by Leroy et al. 2019 bioRxiv 505610, ver. 4 peer-reviewed and recommended by PCI Evolutionary Biology (paper: here, scripts: here)
python ../VCF2Fasta_fast.py -q [threshold_base_quality] -m [minimum_coverage_position] -M [maximum_coverage_position] -f [VCF] > [outputdir]/Outputs_VCF2Fasta.txt

Compute theta, pi & Tajima's D (./3.2.4/Scripts_compute_pi_D/)
Software needed: seq_stat
bash script_compute_pi_slidwin.sh [referencegenome].scafflength [VCF] [ouputdir] [size_of_sliding_window] [output_prefix]
(scafflength = file containing the length of each scaffold as computed by ./3.2.2/5-Joint_genotyping/script_scaff_length.py):  

Generate a circlize plot (./3.2.4/Rscript_plot_Circlize/) 
R package needed: circlize (more details: here) 
Details shown in script_circlize_GC_pi_ROD_Africanrice.R (Import the file containing the summary statistics & generate a circlize plot)
(the file containing the summary statistics was made available at ./3.2.4/Results_pi/Genomic_pi_ROD_010419.withoutNA.txt)

3.2.5 - Historical population sizes
Software needed: smc++


Convert the vcf to the smc++ input format (./3.2.5/1_vcf2smcpp.sh)
smc++ vcf2smc --cores [nb_cpu] [input_vcf_file] [output_smc_data_files] [chr] [pop1:Ind1,Ind2,Ind3..]

Perform the inference (./3.2.5/2_analysis_smc.sh)
smc++ cv --cores [nb_cpu] --out [out] --Nmax [Ne_max] --knots [spline_knots_for_smoothing] [mutation_rate] [smc_data_files]

Generate a plot (./3.2.5/3_smcpp_plot.sh)
smc++ plot [outfile.pdf] -g 1 -c [infile_model.final.json]
(-c produces a CSV-formatted table: this file is also available ./3.2.5/Rscript_plot/plot_generation.csv)
It is also possible to use the newly generated .csv file to generate the plot under the R environment, e.g. using the R package ggplot2 : see ./3.2.5/Rscript_plot/script_generateplot_smcpp_220519.R for details

3.2.6 - Deleterious mutation load


Counts the number of derived alleles

1/Generate VCF for the outgroup species (./3.2.6/Scripts_derived_alleles/download_trimming_mapping_data_other_species)
Download the raw data for the outgroup species -> generate joint VCF / outgroup species
Same steps than in the section "3.2.2 : From raw data to VCF"
    
2/Detect the ancestral state & compute allele frequencies (./3.2.6/Scripts_derived_alleles/script_vcfoutgroup_to_ancestral_derived.sh) 

In a nutshell:
Parse the 3 joint vcf (the focal species & the 2 newly obtained vcf corresponding to the 2 outgroup species) 
./script_parser_vcf.py [VCF_focal_species_ONLY_PASS_variants] [VCF outgroup1] [VCF outgroup2]> [Merged_VCF_file]

Then use awk '{print $X"    "Y...}' (where X and Y correspond to the columns in the [Merged_VCF_file]) to parse the data to obtain the following file format:
chr    pos focal_All1   focal_All2  outgroup1_all1  outgroup1_all2 outgroup2_all1   outgroup2_all2
1	1248	G	A	G	.	G	.

Detect the ancestral allele & compute allele frequencies
./script_ancestral_derived_counts.py [this_infile] > [file_with_derived_counts]
./script_compute_derivedallelefreq.py [file_with_derived_counts] > [files_with_derivedallfreq]

Deleterious variant prediction (./3.2.6/Scripts_provean/) 
Softwares needed: blast+ & provean
Follow each step in the order indicated (from script 01 to script 08)

3.2.7 - Fst & genome scans


Compute Fst from vcf (./3.2.7/script_compute_Fst/script_vcftools_Fst.sh) 
Softwares needed: vcftools  
./vcftools --vcf [VCF] --weir-fst-pop [IDs_sp1] --weir-fst-pop [IDs_sp2] --fst-window-size [window_size_in_bp] --fst-window-step [size_window_step_in_bp]
where [IDs_sp1], [IDs_sp2] , ... correspond to a file containing a list of all individuals for sp1, sp2, ... (require the same ID as in the vcf)

To compute Fst on non-overlapping sliding windows (preferred), use the same value for both [window_size_in_bp] and [size_window_step_in_bp]
To compute per-SNP Fst values, set [window_size_in_bp] = 1 and [size_window_step_in_bp] = 1

To generate a circlize plot of the Fst values:
R package needed: circlize (more details: here) 
see ./3.2.7/Fst/Rscript_plot_circlize/script_circlize_Fst_africanrice.R

Perform outlier detection (./3.2.7/pcadapt/Rscript_pcadapt/script_pcadapt.R) 
R package needed: pcadapt (more details: here) 
To import the data, compute PCA, perform scans & generate Manhattan plots, see ./3.2.7/pcadapt/Rscript_pcadapt/script_pcadapt.R

Pool-seq data - Application to the sessile oak data

3.3.1 - Pool-seq vs. individual data


Excel file needed: PIFs  
A simulation comparing the precision in allele frequency estimation for two sequencing strategies: pool-seq & individual sequencing (see here for details).
Here a strategy based on a growing number of individuals sequenced at a pool coverage of 100X is compared to a strategy assuming 20 individuals sequenced separately at 20X. 
Results are shown in ./3.3.1/PIFs_simulation/Poolseq_vs_individual_100xpoolvs20xNbindividuals.sed. 
See also the R script ./3.3.1/PIFs_simulation/script_R_comp_power_poolseq_individual_sequencing.R. 

3.3.3 - From raw data to allele counts


1/ Downloading sequencing reads, trimming & reference genome indexing (see section 3.2.2 above)
Softwares needed: wget (ftp-transfert) & Trimmomatic (read trimming)
See here for a list of all sequencing data available (1 run accession = 1 lane, 4 lanes/pool). A correspondence table between SRA accession IDs and Population IDs as indicated in Leroy et al. 2019 is available (./3.3.3/IDs_correspondence_table.txt/)
The oak reference genome (PM1N) can be downloaded from oakgenome.

2/ Mapping, sorting & removing duplicates (see ./3.3.3/4-Mapping/mapping_O16.sh)
Softwares needed: bowtie2, Samtools (filtering & merging bams), Picard (sorting & removing duplicates)
To run bowtie2 with the sensitive end-to-end mode (see here for more details): 
bowtie2 -p [nb_CPUs_to_use] -k [search_alignments] -q --sensitive -x $ref -1 [file_containing_reads_end1] -2 [file_containing_reads_end2] -t --un-gz [output_file_reads_not_aligned] | samtools view -Shb /dev/stdin > [outfile.bam]
picard-tools/SortSam.jar INPUT= [outfile.bam] OUTPUT=[outfile.bam].pisorted SORT_ORDER=coordinate
picard-tools/MarkDuplicates.jar INPUT=[outfile.bam].pisorted OUTPUT=[outfile.bam].pisorted.dedup METRICS_FILE=[output_file_metrics] VALIDATION_STRINGENCY=LENIENT MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 REMOVE_DUPLICATES=true
samtools merge [total_bam_pop1] [bam_lane1_pop1] [bam_lane2_pop1] [bam_lane3_pop1] [bam_lane4_pop1] 

3/ Pileup & synchronized pileup  (see ./3.3.3/5-Pileup_sync/script_samtools_mpileup.sh)
Softwares needed: Samtools & Popoolation2
samtools mpileup -f [fasta_genome] [total_bam_pop1] [total_bam_pop2] [total_bam_pop3] ... > [outfile_pileup]
java -Xmx4g -jar mpileup2sync.jar --input [outfile_pileup] --output [outfile_pileup].sync --fastq-type sanger --min-qual [minimum_base_quality] --threads [nb_CPUs_to_use]

4/ Generate allele count for each position and SNP
Software needed: Popoolation2
popoolation/snp-frequency-diff.pl --input [outfile_pileup].sync --output-prefix [prefix_for_output_files] --min-count [minimum_number_of_alt_alleles] --min-coverage [min_coverage_per_pool] --max-coverage [max_coverage_per_pool]
./allele_count_withMAF.sh [previous_popoolation_prefix_for_output_files].rc [output-prefix] [tmpfile]
This second script is just to give an example. You can also the R script poolfstat to generate allele counts from the synchronized file, as described in the Rscript ./3.3.6/script_baypass/generate_baypass_inputfile.R (see section 3.3.6 below)

3.3.4 - Population splits & mixtures


Perform TreeMix simulations (./3.3.4/Script_TreeMix/treemix.sh)
Software needed: TreeMix
for i in {1..10}; do 
	./treemix-1.13/src/treemix -i [infile] -k 1000 -m $i -o [outfile].m.$i
done

Compute explained variance, draw phylogenetic trees & show residuals (./3.3.4/Script_TreeMix/treemix.sh)
R scripts needed: TreeMix-associated R scripts & ape
See ./3.3.4/Rscript_TreeMix/script_R_treemix.R for details.

3.3.5 - Fst estimates


Compute fixation index (Fst) from synchronized pileup using poolfstat(./3.3.5/)
R scripts needed: poolfstat, reshape2 & ggplot2
To import data from the popoolation2 synchronized mpileup format, compute pairwise population population Fst matrix, per-SNP pairwise or among-population Fst values or generate plots, see ./3.3.5/script_poolfstat_Fst_Hivert.R for details.

3.3.6 - Genome Scan for Selection


Generate an infile (./3.3.6/script_baypass/script_baypass.sh)
R scripts needed: poolfstat
From the popoolation2 synchronized mpileup format, the popsync2pooldata & pooldata2genobaypass functions are probably the easiest way to generate the baypass input file (see ./3.3.6/script_baypass/generate_baypass_inputfile.R)

Perform genome scans to detect candidate SNPs under selection (./3.3.6/script_baypass/script_baypass.sh)
Software needed: BayPass
i_baypass -npop [nb_pops] -gfile [INFILE] -poolsizefile [FILE_with_popsizes] [+ parameters related to Markov chains, read the BayPass's manual here">here]

Identify outlier loci (./3.3.6/Rscript_XtX/script_baypass_XtX_plots.R
R scripts needed: ggplot2
See ./3.3.6/Rscript_XtX/script_baypass_XtX_plots.R
Generate pseudo-observed datasets (PODS) and Manhattan plots of XtX values highlighting outliers 
For more information on how to perform neutral simulations to calibrate the XtX, see also here">here]

3.3.7 - Genotype-Environment associations


Detect SNPs with clinal variation along environment or phenotypic gradients (./3.3.7/script_baypass_GEA-GPA.sh)
Software needed: BayPass
i_baypass -npop [nb_pops] -gfile [INFILE] -efile [FILE_with_evironmental] -poolsizefile [FILE_with_popsizes] -scalecov [+ parameters related to Markov chains, read the BayPass's manual here">here]
It is important to use the "-scalecov" option to scale all covariables.


Identify Genotype-environment (GEA) or Genotype-Phenotype associations (GPA/pGWAS) (./3.3.7/script_baypass_XtX_BFplots.R)
R scripts needed: ggplot2
See ./3.3.7/script_baypass_XtX_BFplots.R
This script generates plots of XtX vs. Bayes Factors (BF) and Manhattan plots of BF values highlighting SNPs with strong support for associations with environmental or phenotypic variables