radiator: an R package for RADseq Data Exploration, Manipulation and Visualization

This is the development page of the radiator, if you want to help, see contributions section

Most genomic analysis look for patterns and trends with various statistics. Bias, noise and outliers can have bounded influence on estimators and interfere with polymorphism discovery. Avoid bad data exploration and control the impact of filters on your downstream genetic analysis. Use radiator to: import, explore, manipulate, visualize, filter, impute and export your GBS/RADseq data.

radiator is designed and optimized for fast computations using Genomic Data Structure GDS file format and data science packages in tiverse. radiator handles VCF files with millions of SNPs and files of several GB.

Installation

To try out the dev version of radiator, copy/paste the code below:

if (!require("devtools")) install.packages("devtools") # to install
devtools::install_github("thierrygosselin/radiator")
library(radiator)

Learning radiator

See if radiator as the right tools for you:

1. Prepare a strata file

It's a tab separated file, e.g. radiator.strata.tsv.
A minimum of 2 columns: INDIVIDUALS and STRATA is required.
The STRATA column identifies the individuals stratification, the hierarchical groupings: populations, sampling sites or any grouping you want.
It's like stacks population map file with header...

To make sure it's going to work properly, try reading it in R with:

strata <- radiator::read_strata("my.strata.tsv")
names(strata)
# more details in with `??radiator::read_strata`

2. Filter your RADseq data: ONE FUNCTION TO RULE THEM ALL

data <- radiator::filter_rad(data = "my.vcf", strata = "my.strata.tsv", output = c("genind", "hierfstat"))

There's a built-in interactive mode that requires users to visualize the data before choosing thresholds.
The function is made of modules (see below) that user's can access separately or in combination.
Use magrittr %>% to chain filtering functions together and dig deeper into your data see vignettes
But remember, for 95% of users, filter_rad will be enough to start exploring the biology!

Overview

Caracteristics	Description
Import	List of the 11 supported input genomic file formats and their variations: VCF: SNPs and haplotypes (Danecek et al., 2011) DArT files (5): genotypes in 1row, alleles counts and coverage in 2 rows, SilicoDArT genotypes and counts PLINK: bed/tped/tfam (Purcell et al., 2007) genind (Jombart et al., 2010; Jombart and Ahmed, 2011) genlight (Jombart et al., 2010; Jombart and Ahmed, 2011) strataG gtypes (Archer et al., 2016) Genepop (Raymond and Rousset, 1995; Rousset, 2008) STACKS haplotype file (Catchen et al., 2011, 2013) hierfstat (Goudet, 2005) SeqArray (Zheng et al., 2017) SNPRelate (Zheng et al., 2012) Dataframes of genotypes in wide or long/tidy format Reading and tidying is found inside: `genomic_converter`, `tidy_` and `read_` functions
Output	26 genomic data formats can be exported out of radiator. The module responsible for this is `genomic_converter`. Separate modules handles the different formats and are also available: `write_vcf`: VCF (Danecek et al., 2011) `write_plink`: PLINK tped/tfam (Purcell et al., 2007) `write_genind`: adegenet genind and genlight (Jombart et al., 2010; Jombart and Ahmed, 2011) `write_genlight`: genlight (Jombart et al., 2010; Jombart and Ahmed, 2011) `write_gsi_sim`: gsi_sim (Anderson et al. 2008) `write_gtypes`: strataG gtypes (Archer et al. 2016) `write_colony`: COLONY (Jones and Wang, 2010; Wang, 2012) `write_genepop`: Genepop (Raymond and Rousset, 1995; Rousset, 2008) STACKS haplotype file (Catchen et al., 2011, 2013) `write_betadiv`: betadiv (Lamy, 2015) `vcf2dadi`: δaδi (Gutenkunst et al., 2009) `write_structure`: structure (Pritchard et al., 2000) `write_faststructure`: faststructure (Raj & Pritchard, 2014) `write_arlequin`: Arlequin (Excoffier et al. 2005) `write_hierfstat`: hierfstat (Goudet, 2005) `write_snprelate`: SNPRelate (Zheng et al. 2012) `write_seqarray`: SeqArray (Zheng et al. 2017) `write_bayescan`: BayeScan (Foll and Gaggiotti, 2008) `write_pcadapt`: pcadapt (Luu et al. 2017) `write_hzar` (Derryberry et al. 2013) `write_fineradstructure` (Malinsky et al., 2018) `write_related` related (Pew et al., 2015) `write_stockr` for stockR package (Foster el al., submitted) `write_maverick` MavericK (Verity & Nichols, 2016) `write_ldna` LDna (Kemppainen et al. 2015) Dataframes of genotypes in wide or long/tidy format
Conversion function	`genomic_converter` import/export genomic formats mentioned above. The function is also integrated with usefull filters, blacklist and whitelist.
Outliers detection	`detect_duplicate_genomes`: detect and remove duplicate individuals from your dataset `detect_mixed_genomes`: detect and remove potentially mixed individuals `stackr::summary_haplotype` and `filter_snp_number`: Discard of outlier markers with de novo assembly artifact (e.g. markers with an extreme number of SNP per haplotype or with irregular number of alleles)
Filters	Targets of filters: alleles, genotypes, markers, individuals and populations and associated metrics and statistics can be filtered and/or selected in several ways inside the main filtering function `filter_rad` and/or the underlying modules: `filter_rad`: designed for RADseq data, it's the one function to rule them all. Best used with unfiltered or very low filtered VCF (or listed input) file. The function can handle very large VCF files (e.g. no problem with >2M SNPs, > 30GB files), all within R!! `filter_dart_reproducibility`: blaclist markers under a certain threshold of DArT reproducibility metric. `filter_monomorphic`: blacklist markers with only 1 morph. `filter_common_markers`: keep only markers common between strata. `filter_individuals`: blacklist individuals based on missingness, heterozygosity and/or total coverage. `filter_mac`: blacklist markers based on minor/alternate allele count. `filter_coverage`: blacklist markers based on mean read depth (coverage). `filter_genotype_likelihood`: Discard markers based on genotype likelihood `filter_genotyping`: blacklist markers based on genotyping/call rate. `filter_snp_position_read`: blacklist markers based based on the SNP position on the read/locus. `filter_snp_number`: blacklist locus with too many SNPs. `filter_ld`: blacklist markers based on short and/or long distance linkage disequilibrium. `filter_hwe`: blacklist markers based on Hardy-Weinberg Equilibrium expectations (HWE). `filter_het`: blacklist markers based on the observed heterozygosity (Het obs). `filter_fis`: blacklist markers based on the inbreeding coefficient (Fis). `filter_whitelist`: keep only markers present in a whitelist
ggplot2-based plotting	Visualize distribution of important metric and statistics and create publication-ready figures
Parallel	Codes designed and optimized for fast computations using Genomic Data Structure GDS file format and data science packages in tiverse. Works with all OS: Linux, Mac and now PC!

Life cycle

DArT users:

filter_dart: is now deprecated. Please use filter_rad.
tidy_dart and tidy_silico_dart: are now deprecated. Please use read_dart for all the 4 DArT files recognized by radiator.

Missing data: visualization and imputations

Visualizing missing data and it's imputations requires special attention that fall outside the scope of radiator. Inside my package called grur, users can visualize patterns of missingness associated with different variables (lanes, chips, sequencers, populations, sample sites, reads/samples, homozygosity, etc). Several Map-independent imputations of missing genotypes are available: Random Forests (on-the-fly-imputations or predictive modeling), Extreme Gradient Tree Boosting, Strawman imputations (~ max/mean/mode: the most frequently observed, non-missing genotypes is used). Imputations can be conducted overall samples or by populations/strata/grouping. radiator::genomic_converter is integrated with the imputation function of grur.

Prerequisite - Suggestions - Troubleshooting

Parallel computing: follow the steps in this notebook vignette to install the packages with OpenMP-enabled compiler and conduct imputations in parallel.
Installation problems.
Windows users: Install Rtools.
The R GUI is unstable with functions using parallel (more info), so I recommend using RStudio for a better experience.
Using my R Notebook: use the option to run chunks of codes in console, not inline.

Vignettes, R Notebooks and examples

Vignettes (in development, check periodically for updates):

Vignettes with real data for example in the form of R Notebooks take too much space to be included in package, without CRAN complaining. Consequently, vignettes are gradually being excluded from the package and distributed separately, follow the links below.
installation problems notebook vignette
parallel computing during imputations notebook vignette
vcf2dadi Rmd or html

R Notebooks:

Missing data visualization and analysis (html vignette) and (Rmd)

Citation:

To get the citation, inside R:

citation("radiator")

New features

Change log, version, new features and bug history lives in the NEWS.md file

Roadmap of future developments:

Updated filters: more efficient, interactive and visualization included: in progress.
Workflow tutorial that links functions and points to specific vignettes to further explore some problems: in progress
Use Shiny and ggvis (when subplots and/or facets becomes available for ggvis).
Until publication radiator will change rapidly, stay updated with releases and contribute with bug reports.
Suggestions ?

Contributions:

This package has been developed in the open, and it wouldn’t be nearly as good without your contributions. There are a number of ways you can help me make this package even better:

If you don’t understand something, please let me know.
Your feedback on what is confusing or hard to understand is valuable.
If you spot a typo, feel free to edit the underlying page and send a pull request.

New to pull request on github ? The process is very easy:

Click the edit this page on the sidebar.
Make the changes using github’s in-page editor and save.
Submit a pull request and include a brief description of your changes.
“Fixing typos” is perfectly adequate.

rajaldebnath/radiator