Workflow_WGStoStrains
Reproducible workflow in R to process vcf files and characterize SNP patterns from WGS data
Objective
Process raw variant calling files to characterize SNP patterns from WGS data
Data required:
- Variant calling files (genotypes, depth, minor allele read count, quality, etc)
- Meta data such as individual ID, state, collection date, sample type, etc.
- Reference genome in fasta format
- Gene annotation files
Steps in the workflow:
Vcf files from running the WGS pipeline https://bitbucket.org/jgarbe/gopher-pipelines/src/default/ on https://www.msi.umn.edu.
1. Creat BSgenome packages from reference genome
Follow this tutorial to create the R package for each reference genome https://www.bioconductor.org/packages//2.7/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf
2. Process vcf files:
- filter out INDELs and regions prone to sequencing error
- filter out singleton SNPs
- annotate and identify SNPs with functional consequences
3. Characterize SNP patterns:
- Looking at all samples indepedently: this provides overall summaries at a single time point:
- rare and common SNPs
- SNPs shared among different geographical grouping variables such as State (figure below left)
- SNPs presented in different type of samples, such as tissue, fecal, or blood samples (figure below right)
- Group samples by individuals: this provides temporal summaries for all time points:
- SNP patterns within an individual over time (figure below left)
- SNP patterns between herds or states over time (figure below right)
4. Infer strains and ancestral relations:
- Phylogenetic tree
- Minimum spanning tree (figure below left)
- Cluster sample to strains:
- Identify mixture samples and their proportions
- Investigate how strains evolve over time (figure below)