hashtagassignment
In 10x and other single cell sequencing technologies, antibody barcodes
are routinely added to cells before sequencing in order to multiplex
multiple experiments in a single sequencing run. Existing hash tag
assignment strategies (e.g. HTODemux
) implicitly assume total hash tag
read counts to be identical between cells, but this assumption is false.
We observed a strong correlation between the total hash tag read counts
and the number of unique UMIs. The strategy presented here therefore
doesn’t use hashtag thresholds that are shared across cells, but rather
determines the most likely hashtag on a per cell basis. It removes
doublets and low confident cells by thresholding on i) the log2 ratio of
the maximal hash tag’s read count with the second best hash tag ii) the
‘evenness’ (normalized entropy) of hash tag counts and iii) the total
number of reads assigned to hash tags. The second metric will be low if
one hash tag clearly is the winner and higher if reads are more evenly
distributed across hash tags.
This repo is an extremely premature and developmental stadium. Its name and interface is likely to change in the future, if we continue development.
Installation
if (!require('hashtagassignment')) {
# install.packages("devtools")
# devtools::install("~/libs/hashtagassignment")
devtools::install_github("slagtermaarten/hashtagassignment")
}
library(hashtagassignment)
Example usage
library(dplyr)
data_dir <- '/DATA/users/m.slagter/MirjamHoekstra/raw_exp_5310'
hashtag_counts <- extract_hashtags_from_cellranger(data_dir = data_dir)
## Define thresholds
## Cells need the winning hash tag to be at least 2^2 (=4) times larger than the
## second best hash hag
fd_thresh = 2
## Cells can have a hashtag evenness of at most .5
evenness_thresh = .5
## Minimal amount of hash tag reads cells need to have
read_thresh = 0
stats <- compute_hashtag_stats(hashtag_counts,
fd_thresh = fd_thresh,
evenness_thresh = evenness_thresh,
read_thresh = read_thresh)
## Filter for cells that do meet all of the criteria
if (exists('seurat_object')) {
# seurat_object <- load_seurat('exp5310')
# rownames(seurat_object@meta.data) <-
# names(seurat_object@active.ident) <-
# gsub('5310_', '', rownames(seurat_object@meta.data))
filtered_seurat_object <- filter_seurat_stats(seurat_object, stats)
}
Select cells to serve as examples for different levels of evenness ([0, .1, .2, …]). Notice how the distributions get more fuzzy as evenness increases.
max_evenness <- floor(max(stats$hashtag_evenness) * 10) / 10
hashtag_evenness_examples <- purrr::map_dfr(seq(0, max_evenness, by = .1),
function(e) {
idx <- order((stats$hashtag_evenness - e)^2)[1:2]
stats %>%
{ .[idx, ] } %>%
dplyr::select(sample_id,
matches('HTO|human_Hashtag'), hashtag_evenness, fd_cc)
}) %>%
unique %>%
dplyr::rename_with(.fn = function(x) gsub('(HTO\\d+)_.*', '\\1', x)) %>%
dplyr::rename_with(.fn = function(x) gsub('human_Hashtag', 'HTO', x)) %>%
dplyr::mutate(hashtag_evenness = round(hashtag_evenness, 2)) %>%
dplyr::mutate(fd_cc = round(fd_cc, 2)) %>%
{ . }
hashtag_evenness_examples$selected <-
with(hashtag_evenness_examples,
is.na(fd_cc) | fd_cc >= fd_thresh,
hashtag_evenness <= evenness_thresh)
knitr::kable(hashtag_evenness_examples)
sample_id | HTO1 | HTO2 | HTO3 | HTO4 | HTO5 | HTO6 | HTO7 | hashtag_evenness | fd_cc | selected |
---|---|---|---|---|---|---|---|---|---|---|
AGCGTCGGTACCGTCG | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | NA | TRUE |
TTGCGTCGTACTCGAT | 187 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | NA | TRUE |
CATGCGGTCCGTCCTA | 10 | 3 | 1 | 0 | 4 | 565 | 2 | 0.10 | 5.69 | TRUE |
GGATCTACATTCACAG | 14 | 2 | 0 | 3 | 0 | 511 | 1 | 0.10 | 5.09 | TRUE |
AGGGAGTGTTCATCGA | 20 | 4 | 0 | 1 | 2 | 298 | 2 | 0.20 | 3.83 | TRUE |
GTGGTTAGTGAGTAAT | 10 | 12 | 407 | 1 | 2 | 5 | 4 | 0.20 | 4.97 | TRUE |
GATCGTAGTCGCTTGG | 109 | 1 | 1 | 3 | 1 | 14 | 0 | 0.30 | 2.87 | TRUE |
CACAGATTCAACTGGT | 47 | 4 | 391 | 2 | 9 | 6 | 1 | 0.30 | 3.03 | TRUE |
GATTGGTAGAACTCCT | 56 | 4 | 1 | 169 | 0 | 1 | 5 | 0.40 | 1.58 | FALSE |
ATGTCTTCACGTATAC | 7 | 3 | 1 | 0 | 0 | 0 | 33 | 0.40 | 2.09 | TRUE |
ATCGTAGAGAGCAGCT | 13 | 4 | 1 | 85 | 3 | 6 | 3 | 0.50 | 2.62 | TRUE |
TTTACTGCAACTGTGT | 6 | 3 | 0 | 3 | 1 | 0 | 31 | 0.50 | 2.19 | TRUE |
GTGCTGGCAGATGCGA | 6 | 0 | 1 | 0 | 2 | 2 | 0 | 0.60 | 1.22 | FALSE |
TTGATGGTCCATCGTC | 104 | 3 | 26 | 4 | 131 | 9 | 0 | 0.60 | 0.33 | FALSE |
ATCTTCAGTGGTACAG | 12 | 9 | 0 | 1 | 1 | 4 | 1 | 0.70 | 0.38 | FALSE |
AGGCCACAGTACAACA | 5 | 1 | 0 | 1 | 2 | 1 | 0 | 0.70 | 1.00 | FALSE |
CGAGTTACAGAACTCT | 21 | 17 | 2 | 19 | 1 | 11 | 1 | 0.80 | 0.14 | FALSE |
TCAAGCAAGAAACCAT | 7 | 2 | 0 | 1 | 1 | 3 | 2 | 0.79 | 1.00 | FALSE |
Create panel of plots to assess what kind and how many cells will be filtered out
library(gridExtra)
library(ggplot2)
library(patchwork)
p1 <- stats$total_hashtag_reads %>% log10 %>%
qplot(bins = 100) + xlab('log10(# hashtag reads)')
p2 <- ggplot(mapping = aes(x = hashtag_evenness, y = fd_cc),
data = stats) +
geom_hex(bins = 150) +
xlab('Hashtag evennness') +
geom_hline(yintercept = fd_thresh) +
geom_vline(xintercept = evenness_thresh) +
ylab('log2 fold difference\n of winning\n with closest contender') +
ggplot2::theme(legend.position = c(.95, .95), legend.justification = c(1, 1))
p4 <- ggplot(mapping = aes(x = hashtag_evenness), data = stats) +
geom_histogram(bins = 100) +
geom_vline(xintercept = evenness_thresh) +
xlab('Hashtag evennness') +
ylab('Number of cells')
p5 <- stats %>%
dplyr::group_by(fd_crit, evenness_crit, read_crit) %>%
dplyr::count() %>%
dplyr::mutate(frac = round(n / nrow(stats), 3)) %>%
gridExtra::tableGrob(theme = gridExtra::ttheme_default(base_size = 6))
(p1 + p2) / (p4 + p5)
Comparison with HTODemux
seurat_object <- load_seurat('exp5310', appendix = '')
hashtag_counts <- extract_hashtags_from_cellranger(data_dir = data_dir)
seurat_object <- RenameCells(seurat_object,
old.names = rownames(seurat_object@meta.data),
new.names = gsub('5310_', '', rownames(seurat_object@meta.data)))
shared_cells <- intersect(colnames(seurat_object), colnames(hashtag_counts))
seurat_object <- seurat_object[, shared_cells]
hashtag_counts <- hashtag_counts[, shared_cells]
seurat_object[['HTO']] <- CreateAssayObject(counts = hashtag_counts)
seurat_object <- NormalizeData(seurat_object,
assay = 'HTO', normalization.method = 'CLR')
seurat_object <- HTODemux(seurat_object, assay = 'HTO', positive.quantile = 0.99)
table(seurat_object$HTO_classification.global)
##
## Doublet Negative Singlet
## 340 236 2380
HTO_class <- seurat_object@meta.data %>%
dplyr::select(HTO_classification, HTO_classification.global, nCount_RNA) %>%
rownames_to_column('sample_id')
merged_stats <- stats %>%
inner_join(HTO_class, by = 'sample_id')
Tally of cells deemed fit by HTODemux
and/or hashtagassignment
. For
this dataset, hashtagassignment
‘loses’ 102 cells and ‘gains’ 365 as
compared to HTODemux
. Thus, 263/3204 more cells remain for further
analysis with hashtagassignment
.
merged_stats %>%
dplyr::select(all_crit, HTO_classification.global) %>%
dplyr::mutate(HTODemux = HTO_classification.global == 'Singlet') %>%
dplyr::rename(hashtagassignment = all_crit) %>%
dplyr::group_by(hashtagassignment, HTODemux) %>%
dplyr::summarize(N = n()) %>%
knitr::kable()
hashtagassignment | HTODemux | N |
---|---|---|
FALSE | FALSE | 211 |
FALSE | TRUE | 102 |
TRUE | FALSE | 365 |
TRUE | TRUE | 2278 |
Three examples of cells that are labeled as ‘Singlets’ by HTODemux
but
deemed unassignable to a sample by hashtagassignment
.
merged_stats %>%
dplyr::filter(all_crit == FALSE & HTO_classification.global == 'Singlet') %>%
dplyr::sample_n(3) %>%
knitr::kable()
sample_id | fd_cc | hashtag_evenness | dominant_hashtag | total_hashtag_reads | human_Hashtag1 | human_Hashtag2 | human_Hashtag3 | human_Hashtag4 | human_Hashtag5 | human_Hashtag6 | human_Hashtag7 | fd_crit | evenness_crit | read_crit | all_crit | HTO_classification | HTO_classification.global | nCount_RNA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GTGTAACGTCTTACTT | 2.277 | 0.534 | 5 | 88 | 12 | 2 | 1 | 2 | 62 | 6 | 3 | TRUE | FALSE | TRUE | FALSE | human-Hashtag5 | Singlet | 7136 |
CGGAATTTCCAAGCCG | 0.830 | 0.611 | 1 | 28 | 15 | 1 | 2 | 8 | 0 | 2 | 0 | FALSE | FALSE | TRUE | FALSE | human-Hashtag4 | Singlet | 5079 |
AAACGAATCTTGCAGA | 0.138 | 0.646 | 1 | 24 | 10 | 3 | 0 | 1 | 9 | 1 | 0 | FALSE | FALSE | TRUE | FALSE | human-Hashtag5 | Singlet | 2335 |
Three examples of cells that are not labeled as ‘Singlets’ by HTODemux
but deemed assignable to a sample by hashtagassignment
.
merged_stats %>%
dplyr::filter(all_crit == TRUE & HTO_classification.global != 'Singlet') %>%
dplyr::sample_n(3) %>%
knitr::kable()
sample_id | fd_cc | hashtag_evenness | dominant_hashtag | total_hashtag_reads | human_Hashtag1 | human_Hashtag2 | human_Hashtag3 | human_Hashtag4 | human_Hashtag5 | human_Hashtag6 | human_Hashtag7 | fd_crit | evenness_crit | read_crit | all_crit | HTO_classification | HTO_classification.global | nCount_RNA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCCATCGAGTTGGACG | 5.53 | 0.190 | 5 | 497 | 9 | 9 | 9 | 2 | 462 | 4 | 2 | TRUE | TRUE | TRUE | TRUE | human-Hashtag3_human-Hashtag5 | Doublet | 82089 |
TAACACGAGTTGTAAG | 5.09 | 0.187 | 1 | 72 | 67 | 1 | 1 | 1 | 1 | 0 | 1 | TRUE | TRUE | TRUE | TRUE | Negative | Negative | 957 |
GAGCCTGTCCGTCAAA | 2.29 | 0.370 | 3 | 553 | 88 | 1 | 433 | 6 | 18 | 3 | 4 | TRUE | TRUE | TRUE | TRUE | human-Hashtag3_human-Hashtag5 | Doublet | 78372 |
Choosing between two evils
HTODemux
is appropriate when the major and only factor driving hash
tag reads is intrinsic to the hash tags: some barcode antibodies might
be more easily read-off than others (for instance due to GC-content
differences between their barcodes). The current hashtagassignment
library rather currently assumes no such intrinsic differences but
rather assumes hash tag read counts to be primarily driven by an
extrinsic, cell-specific factors that determine the total amount of hash
tag reads, which in turn is related to the total amount of detected UMIs
per cell. This factor could be explained by the amount of reagents
available to the single cell in the droplet or the cell’s overall state
of ‘happiness’.
Looking at hash tag count distributions in our dataset, there indeed
seem to be readability differences between hash tags: some hash tags are
higher in read counts than others. This argues for HTODemux
’s
approach.
as.data.frame(t(hashtag_counts)) %>%
reshape2::melt() %>%
dplyr::filter(value > 10) %>%
ggplot(aes(x = variable, y = value, fill = variable)) +
geom_violin() +
ggpubr::stat_compare_means() +
ylab('Hash tag read count') +
xlab('') +
guides(fill = 'none')
There is, however, a relationship between the amount of detected UMIs
and total hash tag reads, arguing that cell intrinsic factors also play
a role, albeit weaker than I initially thought. HTODemux
currently
does not consider this source of bias and could benefit from doing so.
An easy, perhaps naive, way of implementing this would be to first
compute cell intrinsic factors, correct hash tag counts for these
factors, and then run HTODemux
as usual.
ggplot(merged_stats, aes(x = nCount_RNA, y = total_hashtag_reads)) +
geom_point() +
geom_smooth(method = 'lm') +
ylab('Total hash tag read count') +
xlab('Total RNA UMI count')