davismcc/archive-scater

Add warning if cpm supplied as well as counts to newSCESet()

Closed this issue · 3 comments

In hierarchy of expression values, cpm will be computed from counts, and recomputed in subsequent normalization steps.

LTLA commented

We should add a section in ?newSCESet detailing this behaviour. Namely, the hierarchy of values is:

tpm = counts > cpm > fpkm > exprs

where values lower in the hierarchy will be computed from values higher in the hierarchy. Values higher in the hierarchy will never be computed from those lower in the hierarchy. If two values are supplied (excepting tpm and counts), a warning will be raised and only the higher-level value will be retained. Otherwise the values might not be consistent and would change upon recalculation.

Recalculation is done by normalize.SCESet and uses the highest-level value that is available. So, for example, exprs will be calculated from counts if they are available, but will be computed from cpm or fpkm if not. We should also use the total_counts in the QC metrics to compute the CPMs, otherwise they might change unpredictably upon subsetting if we use colSums.

LTLA commented

While we're on the topic of modifying newSCESet, here's some further suggestions:

  • Set logExprsOffset=NULL, and default it to 1 inside the function. However, if it is NULL and countData is not supplied, raise a warning. This is because a sensible offset for normalized values depends on the size of those values, e.g., adding 1 to CPMs of 0.02 would be grossly inappropriate as you would squeeze the log-fold differences between values towards zero. In such cases, users should select their own offset, probably around 1e6/lib.size (i.e., the CPM corresponding to a count of unity).
  • Set lowerDetectionLimit=NULL and default it to zero inside the function. If it is NULL and only exprsData is supplied, raise a warning, because we don't know what a sensible detection limit is for arbitrary expression values.
  • Should logged just always be TRUE? We could define exprs as log-transformed expression values, which would simplify interpretation later on.
  • Do we really need useForExprs? If users want something else, they can use one of the other getters; they can set exprs_value to the appropriate field; or they can define exprsData manually. From a programming perspective, it's dangerous to be expecting some log-transformed value and then getting CPMs instead.

Thanks for implementing these changes, Aaron. newSCESet is in much better shape thanks to these thoughts and implementation.