Add warning if cpm supplied as well as counts to newSCESet()
Closed this issue · 3 comments
In hierarchy of expression values, cpm will be computed from counts, and recomputed in subsequent normalization steps.
We should add a section in ?newSCESet
detailing this behaviour. Namely, the hierarchy of values is:
tpm = counts > cpm > fpkm > exprs
where values lower in the hierarchy will be computed from values higher in the hierarchy. Values higher in the hierarchy will never be computed from those lower in the hierarchy. If two values are supplied (excepting tpm
and counts
), a warning will be raised and only the higher-level value will be retained. Otherwise the values might not be consistent and would change upon recalculation.
Recalculation is done by normalize.SCESet
and uses the highest-level value that is available. So, for example, exprs
will be calculated from counts
if they are available, but will be computed from cpm
or fpkm
if not. We should also use the total_counts
in the QC metrics to compute the CPMs, otherwise they might change unpredictably upon subsetting if we use colSums
.
While we're on the topic of modifying newSCESet
, here's some further suggestions:
- Set
logExprsOffset=NULL
, and default it to 1 inside the function. However, if it isNULL
andcountData
is not supplied, raise a warning. This is because a sensible offset for normalized values depends on the size of those values, e.g., adding 1 to CPMs of 0.02 would be grossly inappropriate as you would squeeze the log-fold differences between values towards zero. In such cases, users should select their own offset, probably around1e6/lib.size
(i.e., the CPM corresponding to a count of unity). - Set
lowerDetectionLimit=NULL
and default it to zero inside the function. If it isNULL
and onlyexprsData
is supplied, raise a warning, because we don't know what a sensible detection limit is for arbitrary expression values. - Should
logged
just always beTRUE
? We could defineexprs
as log-transformed expression values, which would simplify interpretation later on. - Do we really need
useForExprs
? If users want something else, they can use one of the other getters; they can setexprs_value
to the appropriate field; or they can defineexprsData
manually. From a programming perspective, it's dangerous to be expecting some log-transformed value and then getting CPMs instead.
Thanks for implementing these changes, Aaron. newSCESet
is in much better shape thanks to these thoughts and implementation.