/Metabarcodings_Signal_from_Noise

Code and data for analysis Distinguishing Signal from Noise: Understanding Patterns of Non-Detections to Inform Accurate Quantitative Metabarcoding

Primary LanguageR

Distinguishing Signal from Noise: Understanding Patterns of Non-Detections to Inform Accurate Quantitative Metabarcoding

Zachary Gold1,2, Andrew Olaf Shelton2, Helen R. Casendino3, Joe Duprey3, Ramón Gallego2, Amy Van Cise2, Mary Fisher4, Alexander J. Jensen2, Erin D’Agnese3, Elizabeth Andruszkiewicz Allan3, Ana Ramón-Laca2, Maya Garber-Yonts3, Michaela Labare5, Kim M. Parsons2, Ryan P. Kelly3

1 Cooperative Institute for Climate, Ocean, & Ecosystem Studies, UW, Seattle, WA 2 Northwest Fisheries Science Center, NMFS/NOAA, Seattle, WA 3 School of Marine and Environmental Affairs, UW, Seattle, WA 4 School of Aquatic Fisheries Science, UW, Seattle, WA 5 Scripps Institution of Oceanography, UCSD, La Jolla

Abstract

Correcting for amplification biases in genetic metabarcoding data can yield quantitative estimates of template DNA concentrations. However, a major source of uncertainty in metabarcoding data is the presence of non-detections, where a technical PCR replicate fails to detect a species observed in other replicates. Such non-detections are an important special case of variability among technical replicates in metabarcoding data, particularly in environmental samples. While many sampling and amplification processes underlie observed variation in metabarcoding data, understanding the causes of non-detections is an important step in distinguishing signal from noise in metabarcoding studies. Here, we use both simulated and empirical data to 1) develop a qualitative understanding of how non-detections arise in metabarcoding data, 2) outline steps to recognize uninformative data in practice, and 3) identify the conditions under which amplicon sequence data can reliably detect underlying biological signals. We show in both simulations and empirical data that, for a given species, the rate of non-detections among technical replicates is a function of both the template DNA concentration and species-specific amplification efficiency. Consequently, we conclude metabarcoding datasets are strongly affected by (1) deterministic amplification biases during PCR and (2) stochastic sampling of amplicons during sequencing — both of which we can model — but also by (3) stochastic sampling of rare molecules prior to PCR, which remains a frontier for quantitative metabarcoding. Our results highlight the importance of estimating species-specific amplification efficiencies and critically evaluating patterns of non-detection in metabarcoding datasets to better distinguish environmental signal from the noise inherent in molecular detections of rare targets.

Description

This page is dedicated to hosting code generated for the Signal from Noise Manuscript currently in submission to PLOS Biology and will be made available as a pre-print. Included on this page is

  1. Code

    1. calcofi_signal_noise_20220820.Rmd This script does most of the analyses of empirical data sets and generates figures 2 and 3 in the paper.
    2. mc31_organization_20210105.Rmd This script organizes the mock community data.
    3. mc31_organization_coastal_even_redo_20220408.Rmd This script organizes additional mock community data.
    4. taxonomy_matcher_12S_20210106.Rmd This script creates the final taxonomic paths for the mock community data.
    5. taxonomy_matcher_12S_mock_even_redo_20220408.Rmd This script creates the final taxonomic paths for the mock community data.
  2. Data

    1. All_amp_efficiencies-2022-06-03.csv Calculated amplification efficiencies from the mock community data.
    2. input_dna_conc_communities_20210103.csv Mock community metadata including starting input concentrations of DNA.
    3. microscopy_tech_nReads.RDS Microscopy data from Gold et al. 2022. See manuscript for full description of data and how data were generated.
    4. mifish_mock_community_data.RDS Mock community data.
    5. mifish_tech_nReads.RDS Metabarcoding data from Gold et al. 2022. See manuscript for full description of data and how data were generated.

    mock_sequences 1. CRUX_DB 1. global and local reference databases from Gold et al. 2021 2. Output fasta files from taxonomy_matcher*.rmd scripts 3. Blast output from salmon sequences. 2. hash.key_updated_c19.RDS Final updated taxonomy table after resolving conflicts. 3. hash.key_updated.RDS Anacapa derived taxonomy table. 4. mock_even_redo 1.c19_fishcard_ASV_raw_taxonomy_60.txt Anacapa output for mock community using the global reference database 2. metadata_kenai1_20220408.csv metadata file 5. updated_0106 1. 12S_fishcard_taxonomy_tables Anacapa output for mock community using the local Calfiornia Current Large Marine Ecosystem reference database 2. c19_fishcard_taxonomy_tables Anacapa output for mock community using the Global reference database 3. p16S_shark_taxonomy_tables Anacapa output for mock community using a 16S fish reference database (data not used in this manuscript)

Github will be updated with pre-print, NCBI SRA, and Dryad information as they are generated and made available through the review process.