- Improving dimensionality reduction, clustering, visualization and motif analysis for single-cell ATAC-seq
- multimodal analysis
Public datasets of scATAC-seq:
-
- Splenocyte (mouse) | original paper (k=12, n=3166, #peaks=77453, FACS)
- Forebrain (mouse) (dense matrix, k=8, n=2088, , #peaks=11285, cellular indexing)
- Mouse Atlas (mouse) (sparse matrix, k=30, n=~80,000)
- Leukemia (k=6, n=319, #peaks=7602, Fluidigm C1)
- InSilico (k=6, n=828, #peaks=13668, Fluidigm C1)
- Breast tumor (k=2, n=384, #peaks=27884, FACS)
- GM12878vsHEK (k=2, n=526, #peaks=12938, cellular indexing)
- GM12878vsHL (k=2, n=597, #peaks=10431, cellular indexing)
Existing pipelines for scATAC-seq analysis
- SCALE - source | paper
- cisTopic - source | paper
- WarpLDA: faster than LDA with Collapsed Gibbs Sampling
The running time of LDA is mainly determined by two factors: (a) the sampling complexity per-token (here, a token refers to an occurrence of a word) , and (b) the size of random accessed memory per-document (whose time complexity is roughly proportional to the average latency of each access). the Collapsed Gibbs Sampling (CGS) algorithm (which cisTopic adopted previously) is too expensive for large dataset because it has an O(K) (K = the number of topics) complexity per token by enumerating all the K topics assignments. Several existing algorithms (e.g. Metropolis-Hastings (MH)) has reduced (a) from O(K) to O(1), but failed to improve (b), leaving it O(KV) (V = the size of vocabulary). Reducing (b) is hard because it is difficult to decouple the access to Cd (the count matrix of topic-document assignment) and Cw (the count matrix of topic-word assignment), since both counts need to be updated instantly after the sampling of each token. Now in WarpLDA, based on MH and a new Monte-Carlo Expectation Maximization (MCEM) algorithm, in which both counts are fixed until the sampling of all tokens are finished, they designed a reordering strategy to decouple the accesses to Cd and Cw, thus minimize (b) to O(K), while maintaining (a) as O(1).
- snapATAC - source | paper
- scABC - source | paper
- Cicero (scATAC version of Monocle, including trajectory inference)- source | paper | Monocle3 (scRNA-seq)
- STREAM (scATAC-seq trajectory inference: raw count matrix -> trajectory)- source | paper | website
scRNA-seq tutorial and benchmarking
- https://scrnaseq-course.cog.sanger.ac.uk/website/biological-analysis.html
- https://github.com/yuchaojiang/ISMB2020_SingleCellTutorial
Labs working on sc analysis
Topic modelling software http://www.cs.columbia.edu/~blei/topicmodeling_software.html https://github.com/joewandy/hlda
Other useful comments: http://andrewjohnhill.com/blog/2019/05/06/dimensionality-reduction-for-scatac-data/