benjjneb/dada2

loosing reads during denoising

Hendriek opened this issue · 8 comments

Running dada (with default settings) causes me to loose 16% of my reads. Looking at the description in the paper, I would expect that with the default settings of OMEGA_A = 10e-40 == OMEGA_C all sequences would be assigned to a cluster (ASV), as with these settings a sequence is either made into a cluster of its own, when the probability is < 10e-40, or added to an existing cluster when >=10e-40. However, I find that 16% is not assigned to a cluster. If I set OMEGA_C to zero, all sequences are added to a cluster, and I loose no sequences. However, I would expect this to happen with the default settings too. Please could you explain to me what I am missing in my understanding of the working of the algorithm?

The effect you are seeing is largely driven by singleton ASVs, which are not allowed to form their own clusters and thus may be corrected to an ASV to which their p-val is less than the threshold of 10^-40.

Losing 16% of your reads at this stage is higher than common. One thing to consider is that if you are in an extremely diverse community (e.g. soil), pooling or pseudo-pooling (recommended) of your samples might help in identifying the rare ASVs in each sample.

This is indeed a soil sample, so that might explain the relatively high loss. Thank you for the tip to use pseudo-pooling, I will try that!

I am reopening the issue, as losing singletons does not seem to completely explain the loss. Below are the abundances of the original sequences that are not assigned to an ASV. Only a quarter of the reads is a singleton, although none of the sequences has a really high abundance. Might the cut-off not be singletons but everything below 50?

table(abundance.not.assigned)
abundance.not.assigned
1 2 3 4 5 6 7 8 9 10 11 12 13 18 19 20 21 22 23 26 27 29 33 48
1766 958 362 141 65 33 30 4 10 11 15 1 1 2 2 2 2 1 2 1 2 1 1 1

Below the code I used to make abundance.not.assigned

map2<-ddF@.Data
map3<-map2[[7]]
ori<-derepF@.Data[[1]]
abundance.not.assigned<-ori[is.na(map3)]

I am reopening the issue, as losing singletons does not seem to completely explain the loss. Below are the abundances of the original sequences that are not assigned to an ASV. Only a quarter of the reads is a singleton, although none of the sequences has a really high abundance. Might the cut-off not be singletons but everything below 50?

table(abundance.not.assigned)
abundance.not.assigned
1 2 3 4 5 6 7 8 9 10 11 12 13 18 19 20 21 22 23 26 27 29 33 48
1766 958 362 141 65 33 30 4 10 11 15 1 1 2 2 2 2 1 2 1 2 1 1 1

Below the code I used to make abundance.not.assigned

map2<-ddF@.Data
map3<-map2[[7]]
ori<-derepF@.Data[[1]]
abundance.not.assigned<-ori[is.na(map3)]

Might the cut-off not be singletons but everything below 50?

The cutoff is based on a p-value threshold, and is most relevant for singletons, but will also affect some other relative infrequent ASVs. The distribution of uncorrected sequences looks largely in line with that (although higher on the non-singletons than in a couple datasets I looked at).

I'd repeat my previous suggestion if this is a concern:

One thing to consider is that if you are in an extremely diverse community (e.g. soil), pooling or pseudo-pooling (recommended) of your samples might help in identifying the rare ASVs in each sample.

And you can also always force the correction of all reads by setting OMEGA_C = 0 (see ?setDadaOpt for more information on this).

Dear Benjamin,
I understand the usefulness of pseudo-pooling in this case. However, my issue is more on that I would also like to understand how the algorithm works.
I would expect that with the default settings of OMEGA_A = 10e-40 == OMEGA_C all sequences (with the exception of singletons) would be assigned to a cluster (ASV), as with these settings a sequence is either made into a cluster of its own, when the probability is < 10e-40, or added to an existing cluster when >=10e-40. So my question is basically: what do OMEGA_A and OMEGA_C exactly mean?

However, my issue is more on that I would also like to understand how the algorithm works.

Gotcha. What is happening here is caused by a difference between the p-value used to create new partitions (i.e. identify new ASVs) and that used to decide whether or not to correct reads that were inferred to contain errors.

When creating new partitions (ASVs), the p-value for a unique sequence being inconsistent with the error model is calculated conditional on the fact that the sequence was observed. Effectively, this means that the first read of that sequences is ignored, and that the abundance-1 is used to calculate the p-value. In contrast, the p-value used to decide whether or not to correct the read is not conditional, and uses the full abundance of that unique sequence. This is because it is a "post-hoc" decision once the true ASVs were inferred, rather than the initial inference of the ASVs in which the potential universe of ASVs is theoretically any sequence.

This means that sequences with an abundance N that give a p-value above 10^-40 when using abundance N-1, but a p-value below 10^-40 when using abundance N, are not corrected. In practice, these are unique sequences right on the boundary of the algorithms ability to determine whether they are true ASVs or errors (at a given threshold), and so the decision is made to discard them rather than correct them.

Hope that helps! Happy to get into more technical details as well now that I understand that is where your interests lie.

Thank you very much for your clear answer, I now understand what is happening. At the moment I have no further questions. Thanks again,
Hendriek