stackr: an R package for the analysis of GBS/RADseq data

This is the development page of the stackr, if you want to help, see contributions section

Most genomic analysis look for patterns and trends with various statistics. Bias, noise and outliers can have bounded influence on estimators and interfere with polymorphism discovery. Avoid bad data exploration and control the impact of filters on your downstream genetic analysis. Use stackr to: import, explore, manipulate, visualize, filter, impute and export your GBS/RADseq data.

Installation

To try out the dev version of stackr, copy/paste the code below:

if (!require("devtools")) install.packages("devtools") # to install
devtools::install_github("thierrygosselin/stackr", build_vignettes = TRUE)  # to install WITH vignettes
library(stackr) # to load

Caracteristics	Description
Import	Various supported genomic file formats in `tidy_genomic_format` and/or their separate module: VCF (Danecek et al., 2011) PLINK tped/tfam (Purcell et al., 2007) genind (Jombart et al., 2010; Jombart and Ahmed, 2011) genlight (Jombart et al., 2010; Jombart and Ahmed, 2011), also in `tidy_genlight` strataG gtypes (Archer et al., 2016) Genepop (Raymond and Rousset, 1995; Rousset, 2008), also in `tidy_genepop` STACKS haplotype file (Catchen et al., 2011, 2013) hierfstat (Goudet, 2005), also in `tidy_fstat` Dataframes of genotypes in wide or long/tidy format, also in `tidy_wide`
Output	Export genomic data out of stackr using `genomic_converter` or the separate modules (12 formats): `write_vcf`: VCF (Danecek et al., 2011) `write_plink`: PLINK tped/tfam (Purcell et al., 2007) `write_genind`: adegenet genind and genlight (Jombart et al., 2010; Jombart and Ahmed, 2011) `write_genlight`: genlight (Jombart et al., 2010; Jombart and Ahmed, 2011) `write_gtypes`: strataG gtypes `write_genepop`: Genepop (Raymond and Rousset, 1995; Rousset, 2008) STACKS haplotype file (Catchen et al., 2011, 2013) `write_betadiv`: betadiv (Lamy, 2015) `vcf2dadi`: δaδi (Gutenkunst et al., 2009) `write_structure`: structure (Pritchard et al., 2000) `write_arlequin`: Arlequin (Excoffier et al. 2005) `write_hierfstat`: hierfstat (Goudet, 2005) Dataframes of genotypes in wide or long/tidy format
Conversion function	`genomic_converter` import/export genomic formats mentioned above. The function is also integrated with usefull filters, blacklist and whitelist.
Outliers detection	`detect_duplicate_genomes`: Detect and remove duplicate individuals from your dataset `detect_mixed_genomes`: Detect and remove potentially mixed individuals `summary_haplotype` and `filter_snp_number`: Discard of outlier markers with de novo assembly artifact (e.g. markers with an extreme number of SNP per haplotype or with irregular number of alleles)
Pattern of missingness	`missing_visualization`: Visualize patterns of missing data. Find patterns associated with different variables of your study (lanes, chips, sequencers, populations, sample sites, reads/samples, homozygosity, etc)
Filters	Alleles, genotypes, markers, individuals and populations can be filtered and/or selected in several ways: `filter_coverage`: Discard markers based on read depth (coverage) of alleles and genotypes `filter_genotype_likelihood`: Discard markers based on genotype likelihood `filter_individual`: Discard markers based on proportion of genotyped individuals `filter_population`: Discard markers based on proportion of genotyped populations `filter_het`: Discard markers based on the observed heterozygosity (Het obs) `filter_fis`: Discard markers based on the inbreeding coefficient (Fis) `filter_hw`: Discard markers based on the Hardy-Weinberg Equilibrium expectations (HWE) `filter_maf`: Discard markers based on local or global (overall) minor allele frequency
Imputations	Map-independent imputations of missing genotypes. Using Random Forest or the most frequent category. Imputations can be conducted overall samples or by populations. Imputations are integrated in several functions and in a separate module: `stackr_imputations_module`
ggplot2-based plotting	Visualize distribution of important metric and statistics and create publication-ready figures
Parallel	Codes designed and optimized for fast computations running imputations, iterations, etc. in parallel

Prerequisite - Suggestions - Troubleshooting

Parallel computing: Follow the steps in this vignette to install an OpenMP enabled randomForestSRC package to do imputations in parallel.
Installation problem: see this vignette
Windows users: 1. Install Rtools. 2. To have stackr run in parallel, use parallelsugar. Easy to install and use (instructions).
For a better experience in stackr and in R in general, I recommend using RStudio. The R GUI is unstable with functions using parallel (more info).

Below, the combination of packages and how I install/load them :

if (!require("pacman")) install.packages("pacman")
pacman::p_load(devtools, tidyverse, data.table, parallel, lazyeval, randomForestSRC)
devtools::install_github("thierrygosselin/stackr", build_vignettes = TRUE)
library("stackr")

Vignettes, R Notebooks and examples

Vignettes (in development, check periodically for updates):

Vignettes can also be accessed inside R with:

browseVignettes("stackr") # To browse vignettes
vignette("vignette_vcf2dadi") # To open specific vignette

R Notebooks:

Missing data visualization and analysis (html vignette) and (Rmd)

Citation:

To get the citation, inside R:

citation("stackr")

New features

Change log, version, new features and bug history now lives in the NEWS.md file

v.0.4.5

temporary fix to tidy_genomic_data to read unconventional Tassel VCF
new function ibdg_fh computes the FH measure that was previously computed in summary_haplotypes. It now works with biallelic and multiallelic data.

The FH measure is based on the excess in the observed number of homozygous genotypes within an individual relative to the mean number of homozygous genotypes expected under random mating (see function for details). The IBDg in the name is because the measure is a proxy of the realized proportion of the genome that is identical by descent by reference to the current population under hypothetical random mating.
missing_visualization now computes the FH measure and look for correlation with average missingness per individual.
tidy_stacks_haplotypes_vcf is now deprecated in favor of using tidy_genomic_data that will import haplotypic vcf files.

v.0.4.4

several updates to make functions faster.
stackr_imputations_module no longer imputes globally after imputations by populations. Instead, use common.markers or not to test impacts.
bug fix with change_alleles that was not working properly inside the imputation module.
snp_ld is not a separate module available for users. Check documentation.
missing_visualization now show the proportion of variance with plot axis text.

For previous news: NEWS.md file

Roadmap of future developments:

Until publication stackr will change rapidly (see contributions below for bug reports).
Updated filters: more efficient, interactive and visualization included: in progress
Workflow tutorial that links functions and points to specific vignettes to further explore some problems: in progress
Integration of several functions with STACKS and DArT database.
Use Shiny and ggvis when subplots or facets becomes available...
Suggestions ?

Contributions:

This package has been developed in the open, and it wouldn’t be nearly as good without your contributions. There are a number of ways you can help me make this package even better:

If you don’t understand something, please let me know.
Your feedback on what is confusing or hard to understand is valuable.
If you spot a typo, feel free to edit the underlying page and send a pull request.

New to pull request on github ? The process is very easy:

Click the edit this page on the sidebar.
Make the changes using github’s in-page editor and save.
Submit a pull request and include a brief description of your changes.
“Fixing typos” is perfectly adequate.

GBS workflow

The stackr package fits currently at the end of the GBS workflow. Below, a flow chart using STACKS and other software.

stackr workflow

Currently under construction. Come back soon!

Table 1: Quality control and filtering RAD/GBS data

Parameters	Libraries & Seq.Lanes	Alleles	Genotypes	Individuals	Markers	Sampling sites	Populations	Globally
Quality	x			x
Assembly and genotyping	x
Outliers		x	x	x	x
Missingness	x	x	x	x	x	x	x	x
Coverage		x	x		x
Genotype Likelihood			x
Prop. Genotyped				x	x	x	x	x
HET & FIS & HWE				x	x		x
MAF					x	x	x	x
Missingness	x	x	x	x	x	x	x	x

Step 1 Quality Ask yourself these questions:

DNA quality, libraries quality, sequencing lanes quality ?
Please stop thinking in terms of quantity (e.g. millions of reads returned), and start thinking about actual quality of your new data.
Use quality metrics inside available software (e.g. fastqc)

Step 2 de novo assembly and genotyping

This is conducted outside stackr
Integrated software pipelines include: STACKS, pyRAD, dDocent, AftrRAD. If you want to develop your own pipeline, there are a multitude of approaches, good luck.

Step 3 Outliers

Remove replicates (I hope you have some).
Remove de novo assembly artifact, by creating blacklist of genotypes or whitelist of markers:
- individuals with more than 2 alleles (use summary_haplotypes)
- outlier markers with extreme number of SNP/read or haplotype (use filter_snp_number)
Remove potential duplicated samples that went off your radar, try detect_duplicate_genome.
Highlight outliers individual's heterozygosity that might represent mixed samples with detect_mixed_individuals.
The metric you're using: a de novo artefact or a reliable signal of biological polymorphism?
Should the statistic you are interested in be consistent throughout the read ?
Will you consider haplotype or snp level statistics?
The consistensies of SNPs statistics among haplotype can be throughly tested by using snp.ld argument in several stackr functions.
Any other outliers with different individual's or markers metrics (reads/sample, etc) ?

Step 4 Pattern of missingness

Use missing_visualization with/without your new blacklists (e.g. of genotypes, individuals) and with/without whitelist of markers to examine patterns of missingness in you dataset before more extensive filtering (there is a vignette for this step)
The trick here is to use the strata argument to find patterns associated with different variables of your study (lanes, chips, sequencers, populations, sample sites, reads/samples, etc).
Do you see a trend between your missing pattern and reads/samples ? Heterozygosity?
Do you need more sequencing? Do you have to re-run some lanes?

Step 5-6 Coverage and Genotype Likelihood

Coverage is an individual metric. With most software you'll find allele and genotype coverage info.
Genotype likelihood is usually a metric based on coverage of the different genotypes found in all of your data.
Good allele coverage is required for reliable genotypes.
Reliable genotypes is required for reliable downstream summary statistics.
Explore filtering options in filter_coverage and filter_genotype_likelihood.

Step 7 Prop. Genotyped

Do you have enough individuals in each sampling sites (filter_individual) and enough putative populations (filter_population) for each markers ?
Use blacklist of individuals with different thresholds.
Keep different whitelist of markers.
Use common.markers argument inside most of stackr functions to test the impact of vetting loci based on shared markers.
Use imputation methods provided by stackr (inside tidy_genomic_data or genomic_converter, as a separate module: stackr_imputations_module) to assess the impact of lowering or increasing threshold that impact missing data.

Step 8 HET, Fis, HWE

Overall and/or per populations hwe, heterozygosity and Fis statistics can highlight: de novo assembly problems (oversplitting/undermerging), genotyping problems or biological problems.
These filters allows to test rapidly if departure from realistic expectations are a problem for downstream analysis ?
Choose your threshold wisely and test impact on pipeline.
Use filter_het, filter_fis, filter_hwe and look again at the individual's heterozygosity (detect_mixed_individuals) for outliers.

Step 9 MAF

Remove artifactual and uninformative markers.
Use MAF arguments inside several of stackr functions to tailor MAF to your analysis tolerance to minor allelel frequencies.
There is also a separate filter in stackr: filter_maf

Step 10 Pattern of missingness, again

Use missing_visualization with your new blacklists (e.g. of genotypes, individuals) and with your whitelist of markers to examine patterns of missingness in your dataset after filtering (there is a vignette for this step)
The trick here is to use the strata argument to find patterns associated with different variables of your study (lanes, chips, sequencers, populations, sample sites, reads/samples, etc).
Do you see a trend between your missing pattern and reads/samples ? Heterozygosity?

anne-laureferchaud/stackr-1