/Workflow_WGStoStrains

Reproducible workflow in R to process vcf files and characterize SNP patterns from WGS data

Primary LanguageR

Workflow_WGStoStrains

Reproducible workflow in R to process vcf files and characterize SNP patterns from WGS data

Objective

Process raw variant calling files to characterize SNP patterns from WGS data

Data required:

  • Variant calling files (genotypes, depth, minor allele read count, quality, etc)
  • Meta data such as individual ID, state, collection date, sample type, etc.
  • Reference genome in fasta format
  • Gene annotation files

Steps in the workflow:

Vcf files from running the WGS pipeline https://bitbucket.org/jgarbe/gopher-pipelines/src/default/ on https://www.msi.umn.edu.

1. Creat BSgenome packages from reference genome

Follow this tutorial to create the R package for each reference genome https://www.bioconductor.org/packages//2.7/bioc/vignettes/BSgenome/inst/doc/BSgenomeForge.pdf

2. Process vcf files:

  • filter out INDELs and regions prone to sequencing error
  • filter out singleton SNPs
  • annotate and identify SNPs with functional consequences

3. Characterize SNP patterns:

  • Looking at all samples indepedently: this provides overall summaries at a single time point:
    • rare and common SNPs
    • SNPs shared among different geographical grouping variables such as State (figure below left)
    • SNPs presented in different type of samples, such as tissue, fecal, or blood samples (figure below right)

  • Group samples by individuals: this provides temporal summaries for all time points:
    • SNP patterns within an individual over time (figure below left)
    • SNP patterns between herds or states over time (figure below right)

4. Infer strains and ancestral relations:

  • Phylogenetic tree
  • Minimum spanning tree (figure below left)

  • Cluster sample to strains:
    • Identify mixture samples and their proportions
    • Investigate how strains evolve over time (figure below)

5. Convert tsne to a network:

Iterative Networks