/workshop-popgenome

Tutorial about R package PopGenome

Resources and Preparation

Resources:

PDFs:

To work on iceberg, copy necessary files from fastdata

mkdir PopGenome
cd PopGenome
cp /usr/local/extras/Genomics/workshops/February2019/PopGenome/PopGenome_data.zip ./
unzip PopGenome_data.zip
rm PopGenome_data.zip

Introduction

PopGenome is an R package (https://cran.r-project.org/web/packages/PopGenome/index.html) for analyses of population genomic data. For this tutorial, please make sure that your R working directory is set correctly and you have all the packages installed, e.g. install.packages("PopGenome"). To get an overview about the file contents, inspect files with a text Editor (e.g. Notepad+) or via the command line (e.g. more, nano) before starting to work on them.

Start R and load PopGenome library

# if you are on the login node run qrsh
qrsh
# Load R version
module load  apps/R/3.3.1
# Start R
R
# Install library
install.packages("PopGenome")
# Loading module 
library(PopGenome)

Fasta Files

The file fasta_file.txt contains the coding sequence for one locus/gene from different Arabidopsis thaliana individuals (accessions) and the outgroup sequence from Arabidopsis lyrata in the folder fasta. Fasta is a common file format to store sequence information, for more information: https://en.wikipedia.org/wiki/FASTA_format

# This is in terminal, not in the R shell
# to finish the R shell type
q()
# view file content
more fasta/fasta_file.txt

Reading Fasta Files

# Loading module if you have not done yet
library(PopGenome)
# Reading data, read fasta from folder 
GENOME.class <- readData("fasta") 
get.sum.data(GENOME.class)
get.individuals(GENOME.class)

1.What statistics can one obtain from get.sum.data function?

2.Folders fasta_a, fasta_b, fasta_c contain modified alignments. Identify the differences between the datasets. Why PopGenome fails to load fasta files?

Obtaining summary statistics from alignments

Note: Since calculation of certain population genetic parameters can be computational intense, they have to be executed separately beforehand. For this modules have to be run. Note that module Fst has to be executed with F_st. The statistic Tajima’s D is part of the module neutrality not Fst

# Available statistics and examples 
show.slots(GENOME.class) 
# Run necessary module 
GENOME.class <- F_ST.stats(GENOME.class) 
GENOME.class <- neutrality.stats(GENOME.class) 
GENOME.class@n.sites 
GENOME.class@Pi
GENOME.class@Tajima.D

3.What different modules are available? (show.slots)

4.What module is necessary to be executed in order to obtain Wall.B?

5.How could one obtain a per site estimate of Pi? (look carefully at solutions to question 1)

Obtaining statistics for regions

# Available region data and statistics 
GENOME.class@region.data
GENOME.class@region.stats 
# Examples
GENOME.class@region.data@biallelic.sites[[1]][1:10]
GENOME.class@region.data@transitions[[1]][1:10]

6.How many sites have gaps?

7.How many singletons are in the dataset? (see also An_introduction_to_the_PopGenome_package.pdf, section 3.1)

8.What is the difference between region.data and region.stats? (see also Whole_genome_analyses_using_VCF_files.pdf, section 11 and 12)

Define outgroups and populations

Note: If one ore more outgroup sequences are defined, PopGenome will only consider SNPs where the outgroup is monomorphic; the monomorphic nucleotide is then automatically defined as the major allele (encoded by 0).

# Without defining populations 
get.individuals(GENOME.class)
GENOME.class <- neutrality.stats(GENOME.class,detail=TRUE)
get.neutrality(GENOME.class)[[1]] 
# Define populations with lists
GENOME.class <- set.populations(GENOME.class,list(
c("CON","KAS-1","RUB-1","PER-1","RI-0","MR-0","TUL-0"),
c("MH-0","YO-0","ITA-0","CVI-0","COL-2","LA-0","NC-1") )) 
# Check whether grouping is set correctly 
GENOME.class@region.data@populations
GENOME.class@region.data@populations2 
GENOME.class@region.data@outgroup
# Recalculate statistics for populations 
GENOME.class <-neutrality.stats(GENOME.class,detail=TRUE) 
GENOME.class@Tajima.D 
# Each population 
get.neutrality(GENOME.class)[[1]]
get.neutrality(GENOME.class)[[2]] 
# Set an outgroup 
GENOME.class <-set.outgroup(GENOME.class,c("Alyrata"))
GENOME.class@region.data@outgroup 
GENOME.class <- neutrality.stats(GENOME.class,detail=TRUE)
get.neutrality(GENOME.class)[[1]] 
get.neutrality(GENOME.class)[[2]]

9.Name implemented statistics that require an outgroup, e.g. that are calculated after defining the outgroup.

10.What do you have to pay attention to when applying the McDonald-Kreitman test? (see Whole_genome_analyses_using_VCF_files.pdf)

Analysing VCF files for whole genome data

The files LGE22.gff, LGE22.vcf, LGE22.fa in subfolders in the folder great_tit contain information about a part of chromosome 22 of the passerine bird Parus major (great tit). The fasta file contains the sequence information, the vcf file varianat information of great tit individuals and the gff file inforation about annotated regions in this chromosome.

more great_tit/fasta/LGE22.fasta
more great_tit/gff/LGE22.gff
more great_tit/vcf2/LGE22.vcf

Loading VCF files

There are two ways to read in VCF files, either a folder of VCF files with readData or a single VCF file with readVCF. To read a VCF file using readVCF it needs to be compressed with bgzip and indexed with tabix. The tabix files need to be placed in the same folder as the vcf file.

# What parameters need to be defined 
GENOME.class <-readVCF("great_tit/vcf/LGE22.vcf.gz", 6000,"chrLGE22_Parus_Major_build_1.0.2",1,773534)
GENOME.class@region.names 
GENOME.class <- neutrality.stats(GENOME.class, FAST=TRUE) 
get.sum.data(GENOME.class) 
GENOME.class@region.data

11.What parameters need to be defined to readVCF? (see PopGenome.pdf)

12.What is the overall diversity (theta and pi) of chromosome LGE22?

Loading VCF files with annotation

GENOME2.class <- readData("great_tit/vcf2",format="VCF", gffpath="great_tit/gff") 
get.sum.data(GENOME2.class)
GENOME2.class@region.data 
GENOME2.class <- set.synnonsyn(GENOME2.class, ref.chr="great_tit/fasta/LGE22.fasta")
GENOME2.class@region.data@synonymous
GENOME2.class@region.data@CodingSNPS 
GENOME2.class.syn <- neutrality.stats(GENOME2.class,subsites="syn")
GENOME2.class.syn@Tajima.D 
GENOME2.class.syn@theta_Watterson

13.What is theta Watterson and Tajima’s D of chromosome LGE22 for synonymous and nonsynonymous sites?

Analysing RADseq data using VCF

In the folder rad the file variants.vcf includes RAD sequenced data (https://en.wikipedia.org/wiki/Restriction_site_associated_DNA_markers) of two species. Information about the species can be found in the files ind_species1.txt, ind_species2.txt. The assembled data can be found in rad_assembly.fa

# view file content
more rad/variants.vcf
more rad/ind_species1.txt
more rad/ind_species2.txt
# not included in the zip because of file size
more rad/rad_assembly.fa  

File preparations

First, variants will be split into scaffolds (for computational reasons). Calculations will then be conducted on a smaller subset.

# SPLIT VCF FILE
VCF_split_into_scaffolds("rad/variants.vcf","rad_split_vcf") 
# READ IN DATA, smaller subset
GENOME.class <- readData("rad_split_vcf_small",format="VCF") 
pop1<-as.character(read.table("rad/ind_species1.txt")[[1]]) 
pop2<-as.character(read.table("rad/ind_species2.txt")[[1]]) 
GENOME.class<- set.populations(GENOME.class,list(pop1,pop2),diploid=TRUE) 
# CHECK
GENOME.class@populations

Obtaining statistics from multiple VCFs derived from RADseq

# NEUTRALITY STATISTICS 
GENOME.class <- neutrality.stats(GENOME.class, FAST=TRUE) 
get.neutrality(GENOME.class)[[1]] 
GENOME.class@Tajima.D 
# FST 
GENOME.class <- F_ST.stats(GENOME.class)
get.F_ST(GENOME.class)[[1]] 
GENOME.class@nucleotide.F_ST 
# DIVERSITY
GENOME.class <- diversity.stats(GENOME.class)
get.diversity(GENOME.class) 
GENOME.class@nuc.diversity.within 
# SFS
GENOME.class <-detail.stats(GENOME.class,site.spectrum=TRUE,site.FST=TRUE) 
results <- get.detail(GENOME.class) 
GENOME.class@region.stats@minor.allele.freqs

14.Plot a site-frequency-spectrum for each population.

# Concatenate loci
CON <- concatenate.regions(GENOME.class) 
CON <- detail.stats(CON,site.spectrum=TRUE,site.FST=TRUE) 
results <-get.detail(CON) 
allele_Freqs <- CON@region.stats@minor.allele.freqs[[1]] 
freq.table <- list()
freq.table[[1]] <- table(allele_Freqs) 
sfs <- data.frame(freq.table)

library(ggplot2) 
ggplot(sfs, aes(x=allele_Freqs, y=Freq)) + geom_bar(stat =identity’)

Additional aspects

More information are available in three pdfs accompanied by the program (see folder pdf): An introduction to the PopGenome package: Sliding window analysis, reading SNP data files, coalescent simulations; Whole genome analyses using PopGenome and VCF files: Details about reading tabixed VCF files, examples, graphical output, parallel read-in, pre-filtering VCF files; Package PopGenome: Documentation about all implemented functions with examples

Including features from GFF files to Fasta files

If no gff-file was specified when the data was read in, it is assumed that the alignment is in the correct reading frame (starting at a first codon position). The GFF folder contains GFF-files for each alignment stored in the FASTA folder. The GFF files should have the same names (without any extensions like .fas or .gff) as the corresponding FASTA files to ensure that sequence and annotation are matched correctly.

Handling missing data and differences between readData and readVCF

PopGenome can use missing data, e.g. positions with gaps. Also note the differences between readData and readVCF for your own analysis (for further details see Whole_genome_analyses_using_VCF_files.pdf).