scATAC-seq public datasets

Improving dimensionality reduction, clustering, visualization and motif analysis for single-cell ATAC-seq
multimodal analysis

Public datasets of scATAC-seq:

10x
Buenrostro_2018 (GM12878/HEK293T?)
Cusanovich_2018 (GM12878/HL-60?)
Cusanovich - Mouse sci-ATAC-seq Atlas
Dataset used in SCALE
- Splenocyte (mouse) | original paper (k=12, n=3166, #peaks=77453, FACS)
- Forebrain (mouse) (dense matrix, k=8, n=2088, , #peaks=11285, cellular indexing)
- Mouse Atlas (mouse) (sparse matrix, k=30, n=~80,000)
- Leukemia (k=6, n=319, #peaks=7602, Fluidigm C1)
- InSilico (k=6, n=828, #peaks=13668, Fluidigm C1)
- Breast tumor (k=2, n=384, #peaks=27884, FACS)
- GM12878vsHEK (k=2, n=526, #peaks=12938, cellular indexing)
- GM12878vsHL (k=2, n=597, #peaks=10431, cellular indexing)

Benchmarking - source | paper

Existing pipelines for scATAC-seq analysis

SCALE - source | paper
cisTopic - source | paper
- WarpLDA: faster than LDA with Collapsed Gibbs Sampling

The running time of LDA is mainly determined by two factors: (a) the sampling complexity per-token (here, a token refers to an occurrence of a word) , and (b) the size of random accessed memory per-document (whose time complexity is roughly proportional to the average latency of each access). the Collapsed Gibbs Sampling (CGS) algorithm (which cisTopic adopted previously) is too expensive for large dataset because it has an O(K) (K = the number of topics) complexity per token by enumerating all the K topics assignments. Several existing algorithms (e.g. Metropolis-Hastings (MH)) has reduced (a) from O(K) to O(1), but failed to improve (b), leaving it O(KV) (V = the size of vocabulary). Reducing (b) is hard because it is difficult to decouple the access to Cd (the count matrix of topic-document assignment) and Cw (the count matrix of topic-word assignment), since both counts need to be updated instantly after the sampling of each token. Now in WarpLDA, based on MH and a new Monte-Carlo Expectation Maximization (MCEM) algorithm, in which both counts are fixed until the sampling of all tokens are finished, they designed a reordering strategy to decouple the accesses to Cd and Cw, thus minimize (b) to O(K), while maintaining (a) as O(1).

snapATAC - source | paper
scABC - source | paper
Cicero (scATAC version of Monocle, including trajectory inference)- source | paper | Monocle3 (scRNA-seq)
STREAM (scATAC-seq trajectory inference: raw count matrix -> trajectory)- source | paper | website

Batch effects control

scRNA-seq tutorial and benchmarking

Labs working on sc analysis

Topic modelling software http://www.cs.columbia.edu/~blei/topicmodeling_software.html https://github.com/joewandy/hlda

Other useful comments: http://andrewjohnhill.com/blog/2019/05/06/dimensionality-reduction-for-scatac-data/

hfldai/scATAC-analysis-resources

scATAC-seq public datasets