niaid/dsb

Log transform DSB-normalised corrected ADT counts

Closed this issue · 3 comments

Hi @MattPM,

we used DSB to normalize ADT counts of a CITEseq dataset (multiple timepoints per patient; DSB normalization was done for each point of time separatly) and it worked pretty well! :) .To integrate the different timepoints based on both modalities (ADT+RNA) by WNN we first need to integrate the DSB-normalised ADT counts and RNA counts, respectively by Scanorama. Scanorama requires normalized + log(2) transformed data. Hence, we would like to log2(x+1) transform the DSB normalised counts.

For some antibodies we see negative values. Since log2 transformation of zero/negative values is not possible we shifted the whole dataset using log2(x+(min(x)*-1+1)). Most of the negative values in all datasets to be integrated are between 0 and -5. For some datasets, however, we see a handful of antibodies containing outliers with very high negative values (<-50). Since the min of each dataset is very different the log2 transformed DSB corrected datasets are not comparable any longer.

So how do you handle log(2) transformation of DSB corrected datasets to ensure comparability of ADT counts between datasets (different timepoints per patient)?

hist_PatientI_timepointI
hist_PatientI_timepointII

Thanks for your suggestions! :)

Hi @jkniffka Are the two timepoints from stained with the same pool of antibodies, and they are from the same or separate batches?

Regardless, instead of integration with scanorama, it is likely easier combine the background from both runs and normalize all patients / timepoints together. In our paper we normalized 20 subjects together in a single normalizaiton. If you take a look at the updated preprint Supplemental Fig 9, https://www.biorxiv.org/content/10.1101/2020.02.24.963603v3 we show comparison of normalizing 2 batches (n=10 donors each) separately vs together with 2 different definitions of background drops and the values are highly concordant. If you used the same antibody pool and the same staining conditions, the batches should have pretty good overlap. Even if they don't, for sake of interpretability, I would recommend something like combat or regressing out the batch effect (see limma removebatcheffect function https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/removeBatchEffect) directly on the counts with the batch covariate specified for each cell, rather than using methods designed to stitch together datasets like scanorama. That way you also don't have to worry about the log base 2 normalization. Those integration methods are great for integrating different datasets for example, PBMC data from 2 labs on different platforms. If it is just a batch effect you want to account for, those are likely overkill. I would also try using the dsb values directly with WNN--I would not recommend renormalizing the dsb normalized values.

For those outliers, how many cells out of the total cells in your experiment are outliers way out by -50? (I'm assuming that is just one protein for the cells that have that value)?
-Matt

For reference, in these papers we also normalized many donors and also multiple donor samples over time all together as opposed to separately for each subject:
https://www.cell.com/cell/fulltext/S0092-8674(21)00168-9
https://www.nature.com/articles/s41591-020-0769-8

Thank you for the information. We will try to implement your recommendations!
And regarding the outliers: There are actually only a handful of cells that have such negative values. Most of the antibodies have a lower percentage of negative values, mostly in the single-digit negative range.

I have added a section on this in the documentation.