/radiator

RADseq Data Exploration, Manipulation and Visualization using R

Primary LanguageR

lifecycle Travis-CI Build Status AppVeyor Build Status CRAN_Status_Badge Project Status: Active – The project has reached a stable, usable state and is being actively developed. DOI

packageversion Last-changedate


radiator: an R package for RADseq Data Exploration, Manipulation and Visualization

This is the development page of the radiator, if you want to help, see contributions section

Most genomic analysis look for patterns and trends with various statistics. Bias, noise and outliers can have bounded influence on estimators and interfere with polymorphism discovery. Avoid bad data exploration and control the impact of filters on your downstream genetic analysis. Use radiator to: import, explore, manipulate, visualize, filter, impute and export your GBS/RADseq data.

radiator is designed and optimized for fast computations using Genomic Data Structure GDS file format and data science packages in tiverse. radiator handles VCF files with millions of SNPs and files of several GB.

Installation

To try out the dev version of radiator, copy/paste the code below:

if (!require("devtools")) install.packages("devtools") # to install
devtools::install_github("thierrygosselin/radiator")
library(radiator)

Learning radiator

See if radiator as the right tools for you:

1. Prepare a strata file

  • It's a tab separated file, e.g. radiator.strata.tsv.
  • A minimum of 2 columns: INDIVIDUALS and STRATA is required.
  • The STRATA column identifies the individuals stratification, the hierarchical groupings: populations, sampling sites or any grouping you want.
  • It's like stacks population map file with header...

To make sure it's going to work properly, try reading it in R with:

strata <- radiator::read_strata("my.strata.tsv")
names(strata)
# more details in with `??radiator::read_strata`

2. Filter your RADseq data: ONE FUNCTION TO RULE THEM ALL

data <- radiator::filter_rad(data = "my.vcf", strata = "my.strata.tsv", output = c("genind", "hierfstat"))
  • There's a built-in interactive mode that requires users to visualize the data before choosing thresholds.
  • The function is made of modules (see below) that user's can access separately or in combination.
  • Use magrittr %>% to chain filtering functions together and dig deeper into your data see vignettes
  • But remember, for 95% of users, filter_rad will be enough to start exploring the biology!

Overview

Caracteristics Description
Import List of the 11 supported input genomic file formats and their variations:
VCF: SNPs and haplotypes (Danecek et al., 2011)
DArT files (5): genotypes in 1row, alleles counts and coverage in 2 rows, SilicoDArT genotypes and counts
PLINK: bed/tped/tfam (Purcell et al., 2007)
genind (Jombart et al., 2010; Jombart and Ahmed, 2011)
genlight (Jombart et al., 2010; Jombart and Ahmed, 2011)
strataG gtypes (Archer et al., 2016)
Genepop (Raymond and Rousset, 1995; Rousset, 2008)
STACKS haplotype file (Catchen et al., 2011, 2013)
hierfstat (Goudet, 2005)
SeqArray (Zheng et al., 2017)
SNPRelate (Zheng et al., 2012)
Dataframes of genotypes in wide or long/tidy format
Reading and tidying is found inside: genomic_converter, tidy_ and read_ functions
Output 26 genomic data formats can be exported out of radiator. The module responsible for this is genomic_converter. Separate modules handles the different formats and are also available:
write_vcf: VCF (Danecek et al., 2011)
write_plink: PLINK tped/tfam (Purcell et al., 2007)
write_genind: adegenet genind and genlight (Jombart et al., 2010; Jombart and Ahmed, 2011)
write_genlight: genlight (Jombart et al., 2010; Jombart and Ahmed, 2011)
write_gsi_sim: gsi_sim (Anderson et al. 2008)
write_gtypes: strataG gtypes (Archer et al. 2016)
write_colony: COLONY (Jones and Wang, 2010; Wang, 2012)
write_genepop: Genepop (Raymond and Rousset, 1995; Rousset, 2008)
STACKS haplotype file (Catchen et al., 2011, 2013)
write_betadiv: betadiv (Lamy, 2015)
vcf2dadi: δaδi (Gutenkunst et al., 2009)
write_structure: structure (Pritchard et al., 2000)
write_faststructure: faststructure (Raj & Pritchard, 2014)
write_arlequin: Arlequin (Excoffier et al. 2005)
write_hierfstat: hierfstat (Goudet, 2005)
write_snprelate: SNPRelate (Zheng et al. 2012)
write_seqarray: SeqArray (Zheng et al. 2017)
write_bayescan: BayeScan (Foll and Gaggiotti, 2008)
write_pcadapt: pcadapt (Luu et al. 2017)
write_hzar (Derryberry et al. 2013)
write_fineradstructure (Malinsky et al., 2018)
write_related related (Pew et al., 2015)
write_stockr for stockR package (Foster el al., submitted)
write_maverick MavericK (Verity & Nichols, 2016)
write_ldna LDna (Kemppainen et al. 2015)
Dataframes of genotypes in wide or long/tidy format
Conversion function genomic_converter import/export genomic formats mentioned above. The function is also integrated with usefull filters, blacklist and whitelist.
Outliers detection detect_duplicate_genomes: detect and remove duplicate individuals from your dataset
detect_mixed_genomes: detect and remove potentially mixed individuals
stackr::summary_haplotype and filter_snp_number: Discard of outlier markers with de novo assembly artifact (e.g. markers with an extreme number of SNP per haplotype or with irregular number of alleles)
Filters Targets of filters: alleles, genotypes, markers, individuals and populations and associated metrics and statistics can be filtered and/or selected in several ways inside the main filtering function filter_rad and/or the underlying modules:

filter_rad: designed for RADseq data, it's the one function to rule them all. Best used with unfiltered or very low filtered VCF (or listed input) file. The function can handle very large VCF files (e.g. no problem with >2M SNPs, > 30GB files), all within R!!
filter_dart_reproducibility: blaclist markers under a certain threshold of DArT reproducibility metric.
filter_monomorphic: blacklist markers with only 1 morph.
filter_common_markers: keep only markers common between strata.
filter_individuals: blacklist individuals based on missingness, heterozygosity and/or total coverage.
filter_mac: blacklist markers based on minor/alternate allele count.
filter_coverage: blacklist markers based on mean read depth (coverage).
filter_genotype_likelihood: Discard markers based on genotype likelihood
filter_genotyping: blacklist markers based on genotyping/call rate.
filter_snp_position_read: blacklist markers based based on the SNP position on the read/locus.
filter_snp_number: blacklist locus with too many SNPs.
filter_ld: blacklist markers based on short and/or long distance linkage disequilibrium.
filter_hwe: blacklist markers based on Hardy-Weinberg Equilibrium expectations (HWE).
filter_het: blacklist markers based on the observed heterozygosity (Het obs).
filter_fis: blacklist markers based on the inbreeding coefficient (Fis).
filter_whitelist: keep only markers present in a whitelist
ggplot2-based plotting Visualize distribution of important metric and statistics and create publication-ready figures
Parallel Codes designed and optimized for fast computations using Genomic Data Structure GDS file format and data science packages in tiverse. Works with all OS: Linux, Mac and now PC!

More in radiator workflow below

Life cycle

DArT users:

  • filter_dart: is now deprecated. Please use filter_rad.
  • tidy_dart and tidy_silico_dart: are now deprecated. Please use read_dart for all the 4 DArT files recognized by radiator.

Missing data: visualization and imputations

Visualizing missing data and it's imputations requires special attention that fall outside the scope of radiator. Inside my package called grur, users can visualize patterns of missingness associated with different variables (lanes, chips, sequencers, populations, sample sites, reads/samples, homozygosity, etc). Several Map-independent imputations of missing genotypes are available: Random Forests (on-the-fly-imputations or predictive modeling), Extreme Gradient Tree Boosting, Strawman imputations (~ max/mean/mode: the most frequently observed, non-missing genotypes is used). Imputations can be conducted overall samples or by populations/strata/grouping. radiator::genomic_converter is integrated with the imputation function of grur.

Prerequisite - Suggestions - Troubleshooting

Vignettes, R Notebooks and examples

Vignettes (in development, check periodically for updates):

  • Vignettes with real data for example in the form of R Notebooks take too much space to be included in package, without CRAN complaining. Consequently, vignettes are gradually being excluded from the package and distributed separately, follow the links below.
  • installation problems notebook vignette
  • parallel computing during imputations notebook vignette
  • vcf2dadi Rmd or html

R Notebooks:

Citation:

To get the citation, inside R:

citation("radiator")

New features

Change log, version, new features and bug history lives in the NEWS.md file

Roadmap of future developments:

  • Updated filters: more efficient, interactive and visualization included: in progress.
  • Workflow tutorial that links functions and points to specific vignettes to further explore some problems: in progress
  • Use Shiny and ggvis (when subplots and/or facets becomes available for ggvis).
  • Until publication radiator will change rapidly, stay updated with releases and contribute with bug reports.
  • Suggestions ?

Contributions:

This package has been developed in the open, and it wouldn’t be nearly as good without your contributions. There are a number of ways you can help me make this package even better:

  • If you don’t understand something, please let me know.
  • Your feedback on what is confusing or hard to understand is valuable.
  • If you spot a typo, feel free to edit the underlying page and send a pull request.

New to pull request on github ? The process is very easy:

  • Click the edit this page on the sidebar.
  • Make the changes using github’s in-page editor and save.
  • Submit a pull request and include a brief description of your changes.
  • “Fixing typos” is perfectly adequate.