PhD course in analyses of genotyping and next-generation sequencing data in medical and population genetics 2015/2016
Tuesday - Introduction to NGS data:
- Lecture 4: Allele frequencies, genotypes and SNPs.
- Computer Exercises IV
Friday – Detecting selection and genetic load:
- Lecture 9: Detecting natural selection.
- Computer Exercise IX
The data has been already downloaded and it is provided in /ricco/data/matteo/Data
.
These instructions, including all relevant files and scripts, can be found at /home/matteo/Copenhagen
.
You do not have to do this, but to download and access all the material in this web page locally you should use git.
Uncomment it before running it.
# git clone https://github.com/mfumagalli/Copenhagen.git
# cd Copenhagen
# git pull # to be sure you have the latest version, if so you should see "Already up-to-date."
As an illustration, we will use 80 BAM files of human samples (of African, European, East Asian, and Native American descent), a reference genome, and putative ancestral sequence. To make things more interesting, we have downsampled our data to an average mean depth of 2X!.
We will also use VCF files for 120 individuals from the same populations. The human data represents a small genomic region (1MB on chromosome 2) extracted from the 1000 Genomes Project data set. More information on this project can be found here, including their last publication available here.
All data is publicly available.
For your curiosity, a pipeline to retrieve such data is provided here.
Again, you do not have to run this, but if you want to download all data locally you need to have 'samtools', 'bgzip' and 'Rscript' installed in your /usr/bin anrun the following command inside the Copenhagen
folder in the git package (uncomment it before).
# bash Files/data.sh
Data
and Results
folder will be created automatically inside the Copenhagen
folder.
Data will be saved (but not pushed to git main repository) in Data
folder.
Additional scripts are be provided in the Scripts/
folder.
Please set the path for all programs and data we will be using.
# mandatory
ANGSD=/home/ricco/github/angsd
NGSTOOLS=/home/ricco/github/ngsTools
NGSDIST=$NGSTOOLS/ngsDist
MS=/home/ricco/bin/msdir/ms
SS=/home/ricco/github/selscan/bin/linux
# optional
NGSADMIX=/home/ricco/github/angsd/misc/NGSadmix
FASTME=/home/ricco/bin/fastme-2.1.5/binaries/fastme-2.1.5-linux64
You also need to provide the location of data and sequences:
DIR=/home/matteo/Copenhagen
DATA=/ricco/data/matteo/Data
REF=$DATA/ref.fa.gz
ANC=$DATA/anc.fa.gz
Finally, create a folder where you will put all the results.
mkdir Results
MOTIVATION
Detecting signatures of natural selection in the genome has the twofold meaning of (i) understanding which adaptive processes shaped genetic variation and (ii) identifying putative functional variants. In case of humans, biological pathways enriched with selection signatures include pigmentation, immune-system regulation and metabolic processes. The latter may be related to human adaptation to different diet regimes, depending on local food availability (e.g. the case of lactase persistence in dairy-practicing populations).
The human Ectodysplasin A receptor gene, or EDAR, is part of the EDA signalling pathway which specifies prenatally the location, size and shape of ectodermal appendages (such as hair follicles, teeth and glands). EDAR is a textbook example of positive selection in East Asians (Sabeti et al. 2007 Nature) with tested phenotypic effects (using transgenic mice).
Recently, a genome-wide association study found the same functional variant in EDAR associated to several human facial traits (era shape, chin protusion, ...) in Native American populations (Adhikari et al. Nat Commun 2016).
HYPOTHESIS
- Is the functional allele in East Asian at high frequency in other human populations (e.g. Native Americans)?
- Can we identify signatures of natural selection on EDAR in Native Americans?
- Is selection targeting the same functional variant?
CHALLENGES
- Admixed population
- Low-depth sequencing data
- ...
PLAN OF ACTION
Goal day 1:
- Estimate allele frequencies for tested variant for African, European, East Asian and Native American samples from low-depth sequencing data Optional:
- Investigate population structure of American samples related to Europeans and Africans
- Select individuals with high Native American ancestry
Goal day 2:
- Perfom a sliding windows scan based on allele frequency differentiation
- Assess statistical significance of selection signatures through simulations
- Test for extended haplotype homozygosity on high-depth sequencing data
Write the paper!
- Basics of data handling and filtering
- Maximum likelihood and Bayesian estimation
- Genotype likelihoods
- Allele frequencies, SNPs and genotypes calling
- Brief notes on experimental design
- EM algorithm - Imputation - Phasing (Anders)
Practical 2.30-4pm
- Basic data filtering
- Estimation of allele frequencies and SNP calling
- Genotype calling
- Example: estimation of allele frequencies from low-depth sequencing data: the case of EDAR genetic variation in Native Americans
- Detecting adaptive evolution in the high-throughput sequencing era
- The effect of selection on the genome
- Methods to detect selection signals
- The problem of assessing significance
- Bias introduced by NGS data
- Summary statistics from low-depth data
- Selection scan based on genetic differentiation from low-depth data
- Assessing significance through simulations
- Selection test based on haplotype diversity
- Example: detection of natural selection from low-depth sequencing data and haplotype data: the case of EDAR genetic variation in Native Americans