niaid/dsb

differential expression analysis: log2FC on negative DSB-normalized values

Closed this issue · 9 comments

Hi @MattPM,

Thank you for your work on this package - it has been a large improvement over other normalization methods. I am interested in integrating the DSB-normalized values with Seurat V4 for downstream, differential expression analysis. The differential expression is done using the FindMarkers( ) function (https://satijalab.org/seurat/reference/findmarkers
), which calculates the avg_log2FC between our two conditions. The problem is that the equation for avg_log2FC cannot handle negative values, and the DSB-normalized values we got out of the pipeline include negative values.

Do you have any suggestions for feeding the DSB-normalized values through Seurat's FindMarkers ( ) workflow?

Thank you!

Hi @cyc2145,

Sorry for the slight delay in responding and glad to hear dsb worked well for you. You can use the normalized values from dsb directly with Seurat 4 using Seurat::FindMarkers There are quite a few methods for calculating differential expression with the FindMarkers function. The most simple is a Wilcox test which is just a ratio so the sign of the values doesn't matter (see code below). Some methods like Seurat::FindMarkers(object, method = 'poisson') are for modeling RNA UMI counts directly with a poisson distribution; there the negative values wont work for example--but you wouldn't want to do that with any normalized values anyway. Overall, most methods in FindMarkers will work directly with dsb values.

Here is an example workflow below with simple processing and clustering of the example data then calculating the log fold change between 2 clusters with FindMarkers Let Me know if that works for you.

library(Seurat)
library(dsb)

dsb = dsb::DSBNormalizeProtein(cell_protein_matrix = cells_citeseq_mtx,
                               empty_drop_matrix = empty_drop_citeseq_mtx, 
                               denoise.counts = TRUE, 
                               use.isotype.control = TRUE, 
                               isotype.control.name.vec = rownames(cells_citeseq_mtx)[67:70])

# create object 
s = CreateSeuratObject(counts = cells_citeseq_mtx, assay = "ADT")
s@assays$ADT@data = dsb

# cluster 
prot = setdiff(rownames(s@assays$ADT@data), rownames(s@assays$ADT@data)[67:70])
dmat = dist(t(as.matrix(s@assays$ADT@data)))
s[["adt_snn"]] = FindNeighbors(dmat, nn.eps = 1)$snn
s = FindClusters(s, graph.name = "adt_snn", n.start = 5,n.iter = 5, 
                 algorithm = 3, resolution = 2, random.seed = 1990)

# QC clusters with average heatmap 
adt_data = cbind(s@meta.data, as.data.frame(t(as.matrix(s@assays$ADT@data)))) %>% 
  dplyr::group_by(seurat_clusters) %>% 
  dplyr::summarize_at(.vars = prot, .funs = median) %>%
  tibble::remove_rownames() %>% 
  tibble::column_to_rownames("seurat_clusters") 

pheatmap::pheatmap(t(adt_data),
                   color = viridis::viridis(10, option = "B"), 
                   fontsize_row = 3, border_color = NA)


# DE between 2 clusters with Seurat::FindMarkers using dsb normalized values   
Idents(s) = 'seurat_clusters'
pmark = FindMarkers(s,ident.1 = '0', ident.2 = '15', assay = "ADT",slot = "data")


# visualize distributions 
pmark = pmark %>% 
  tibble::rownames_to_column('prot') %>% 
  dplyr::mutate(nlp = -log10(p_val_adj))

library(ggplot2)
ggplot(pmark, aes(x = avg_log2FC, y = nlp, label = prot)) + 
  geom_point() + 
  ggrepel::geom_text_repel(size = 2)

image

image

Hi Matt,

Thank you for your thorough response. This has worked for me. Do you have any suggestions to filter out antibodies with low ADT counts? In our dataset, we have a lot of antibodies where the median normalized value hovers ~0 and we would like these to be filtered out. Is this as simple as setting a threshold?

Thanks again!

@cyc2145 yes most larger antibody panels will have some proteins that did not stain well so those can be handled in a number of ways. for differential expression, most single cell R packages will have a option for testing only a subset of features (i.e. genes or proteins). With Seurat and FindMarkers you specify a subset of genes or proteins to test with the features argument https://www.rdocumentation.org/packages/Seurat/versions/4.0.2/topics/FindMarkers

As above that would be:

feature_subset = c('CD28_PROT', CD123_PROT')
pmark = FindMarkers(s,ident.1 = '0', ident.2 = '15', assay = "ADT",slot = "data", features = feature_subset)

There is an analogous argument in clustering algorithms as well for Bioconductor and Seurat so if you have a handful of proteins without strong staining you can cluster the cells without those proteins. I'd check out the documentation for those functions its usually an argument like features

Hi, Matt:

A somewhat related question, I ran the WNN using dsb normalized values as in the tutorial, then visualize in heatmap. I noticed in adt_plot heatmap, some of the cluster markers are isotype controls. How is this possible? Should I filter for isotype controls during prots=rownames(seurat.obj@assays$CITE@data)[1:200] ?

Thanks for your time

It is not unusual to see high isotype control levels on some small clusters, for example are these small clusters that could be doublets or cell debris? Often in single cell analysis, despite using a number of quality control steps on the data up front, usually more doublets or low quality cells are revealed by downstream analysis like clustering etc. Some other things you can look into: How many cells are in these clusters? Are there high isotype control levels on all the clusters or just a few? Is the average level influenced by outliers, i.e. what is the range of the isotype controls on these clusters.

Furthermore what is the actual average level on the dsb scale of these isotype controls in the clusters that appear positive? If it is lower than say 3 or 4 then I wouldn't be so concerned since one of the benefits of this normalization method is it allows you to set the same threshold on each protein that is interpretable, so we've used 3.5 in our datasets; anything less than 3.5 means the protein's expression was <3.5 s.d. from the background noise average (+/- adjustment for the cell's technical component); we don't consider expression <3.5 "positive" whereas the positive cells often have a given protein level ranging from ~7 to ~25 (depending on the dataset).

If you have a dataset with a large panel with 200 proteins, it might be worth trying the WNN method using dsb normalized values, but using the default implementation of the algorighm which is to compress the protein data into PCs first. On smaller panels, I've found on a couple datasets compressing the proteins into PCs was not as clean as just using the dsb normalized protein values in the joint model. However, with 200 proteins it could be useful to use a latent representation (principal components) of the protein data for joint clustering--there are code examples for both versions in the main readme.

Thanks @MattPM for your detailed explanation. I did use WNN with dsb normalized values. I got 24 clusters. There are 6 isotype controls. I list cluster number and dsb-normalized value for each isotype control. As you can see the first isotype control has high value. Any insights on why is this? Thanks for your time.

cluster Mouse-IgG2a-kapa-isotype Mouse-IgG1-kapa-isotype Rat-IgG1-kapa-isotype Rat-IgG2a-kapa-Isotype Rat-IgG2b-kapa-Isotype Armenian-Hamster-IgG-Isotype
0 19.2706383 -0.204212707 -0.076987342 0.220289008 -0.016518254 0.040738561
1 30.64981213 -0.297555679 -0.036957571 -0.027957282 -0.13763905 0.098183678
2 20.80745972 -0.267433694 -0.03426126 0.113676663 -0.027071821 0.198823751
3 22.2768805 -0.269099451 0.006872468 0.084844487 -0.091600355 0.213709625
4 20.15281605 -0.223031898 0.00116706 0.242784783 -0.070368197 0.095431377
5 21.5553478 -0.264490986 0.044469213 0.066698204 -0.016834572 0.302089905
6 22.0094134 -0.250543312 0.064722291 0.114242354 0.01124725 0.210936344
7 20.18378488 -0.249005152 -0.060795512 0.139648133 -0.093760222 0.104891275
8 29.16079777 -0.293654343 -0.016442684 -0.052190087 -0.11876174 0.136453012
9 21.06341203 -0.241592334 0.015427778 0.176029584 -0.073348113 0.148008491
10 21.78553354 -0.267169332 0.034452925 0.100874427 0.013159554 0.239126817
11 28.02085752 -0.287254269 0.004093953 -0.054948394 -0.091026777 0.147607799
12 31.75300624 -0.298906987 -0.029576888 -0.071940513 -0.181698369 0.042183718
13 19.87421362 -0.212415843 -0.131392118 0.192073051 -0.103501708 0.119638524
14 20.89966478 -0.245965876 -0.021987074 0.18436226 -0.086572228 0.126136646
15 32.40187905 -0.294755848 -0.027240344 -0.067776111 -0.177592777 0.07406379
16 29.4500741 -0.265287214 -0.034879435 0.001418771 -0.181127854 0.170680966
17 27.94585843 -0.268015409 0.031918943 -0.046192279 -0.054010969 0.118595751
18 31.23966666 -0.275490327 -0.005215045 -0.025623814 -0.232706428 0.080773442
19 29.01119962 -0.30587782 0.016562467 0.021301219 -0.157620684 0.081145163
20 22.05639628 -0.232266519 0.108774798 0.129531889 0.050799755 0.137904576
21 28.15778728 -0.261985877 -0.021448202 0.008043543 -0.143944263 0.29475713
22 34.33514197 -0.289651331 -0.116244934 -0.133321388 -0.093528996 0.298268807
23 28.39816295 -0.302720096 0.008605994 0.007758559 -0.15031475 0.090615059

That looks very strange with the single isotype being extremely universally high and all other isotypes being low, that doesnt have to do with the clustering and I don't think it has to do with normalization. It is really hard to know what is going on there without looking at the data but you will need to look at some QC stuff on your raw data. It almost seems like that protein is not an isotype control.

  • Can you print the function call you used on dsb, as well as what the vector of isotype control names was that you supplied to dsb?

  • What is the maximum value of the isotype controls in the cells apply(cells_citeseq_mtx, 1, FUN = max)

  • Did the other proteins seem coherent with expected cell types?

  • Also I'd check out the raw distributions to look into that example on your data with
    apply(cells_citeseq_mtx, 1, FUN = mean) and
    apply(cells_citeseq_mtx, 1, FUN = sd)
    where cells_citeseq_mtx is your raw data (the cells with proteins as rows). Look at that isotype control raw UMI values, does it look like the others? You can also try
    qplot(apply(cells_citeseq_mtx, 1, mean), apply(cells_citeseq_mtx, 1, sd))

to see where that strange isotype stands relative to other proteins. You can do this on the empty drops as well,

Let us know what you find.

Hi, @MattPM:

I re-run the steps to generate dsb_norm_prot and the isotype value seems normal now.

I noticed in your tutorial, you removed isotype control in Clustering cells based on dsb normalized protein using Seurat section

cluster and run umap (based directly on dsb normalized values without istype controls ).

Will you suggest to remove isotype control for WNN that directly use dsb-normalized value as well?

I also found for 200 proteins, WNN with direct dsb-normalized value vs WNN with PCA compressed dsb values, the UMAP looks quite different. What would you suggest to evaluate which clustering is better?

Thanks

image

Sounds like the issue above was resolved. Yes I would recommend not using the isotopes with clustering. For evaluating what clustering is "better" I think that is going to depend on what biology is revealed, this is a little outside the scope of this normalization package and more a general biological analysis question. It depends on your question, what you are trying to find, what your goals are with the single cell analysis etc. I would not recommend putting too much weight into what the umap looks like though. I would put more weight into differential expression, whether different clusters emerge that have some biological meaning for your analysis, and overlap with known biology to make sure results are overall what you would expect. For your data with many proteins it is possible there is a benefit to the compression with PCA in the default WNN algorithm, we've found either way the results have been improved by using dsb for normalization.