/scHiC_notes

Notes on single-cell Hi-C technologies, tools, and data

MIT LicenseMIT

Notes on single-cell Hi-C technologies, tools, and data

License: MIT PR's Welcome

Single-cell 3D genomics notes. Please, contribute and get in touch! See MDmisc notes for other programming and genomics-related notes.

Table of content

Tools

  • SnapHiC - scHi-C analysis pipeline. Identifies chromatin loops at 10kb resolution. Imputes contact probability with the random walk with restart algorithm (scHiCluster method, considering the effective fragment size, GC content, mappability, details in Methods), distance-normalizes, applies the paired t-test using global and local background to identify loop candidates, groups the loop candidates using the Rodriguez and Lailo's algorithm, and identifies summits within each cluster. Considers global and local background to filter out false positives. Tested on 742 mouse embryonic stem cells, sn-methyl-3C-seq data from 2,869 human prefrontal cortical cells. Compared with HiCCUPS, discovers 4-70 times more cell-type-specific loops, achieves better F1, peak enrichment in APA analysis, CTCF convergent orientation, also detects known long-range interactions. Linking putative target genes and non-coding sequence variants associated with neuropsychiatric disorders. Ground truth for benchmarking: HiCCUPS loops plus long-range interadtions from PLAC-seq and HiChIP experiments from mESCs (MAPS pipeline). Compared with Hi-C-FastHiC, FitHiC2, HiC-ACT, on downsampled data.

    Paper Yu, Miao, Armen Abnousi, Yanxiao Zhang, Guoqiang Li, Lindsay Lee, Ziyin Chen, Rongxin Fang, et al. “SnapHiC: A Computational Pipeline to Identify Chromatin Loops from Single-Cell Hi-C Data.” Nature Methods, August 26, 2021. https://doi.org/10.1038/s41592-021-01231-2.

    Li, Xiaoqi, Lindsay Lee, Armen Abnousi, Miao Yu, Weifang Liu, Le Huang, Yun Li, and Ming Hu. “SnapHiC2: A Computationally Efficient Loop Caller for Single Cell Hi-C Data.” Computational and Structural Biotechnology Journal 20 (2022): 2778–83. https://doi.org/10.1016/j.csbj.2022.05.046. - SnapHiC2, fast reimplementation using a sliding window approach for random walk with restart. Enables data processing at 5kb resolution.

Normalization

  • scHiCNorm - scHi-C normalization using regression against known biases (cutting site density, mappability, CG) using six distributions. Filter out cells with less than 50,000 uniquely mapped reads, merge cells, 1Mb resolution. Ramani 2017 data, 74 matrices. Correlations are assumed to be driven by biases, and decrease in between-dataset correlation and increase in variability is judged as good.

Clustering

  • BandNorm and 3DVI methods for normalizing and denoising single-cell Hi-C data. BandNorm - an R package, distance-centric band normalization approach, improves cell clustering. 3DVI - deep generative modeling framework using Poisson and Negative Binomial distributions to model scHi-C counts, accounting for library size and batch effect for each band matrix, learns low-dimensional representation of scHi-C data, denoises and enables 3D compartment identification (uses scvi-tools). Compared against library size scaling methods (global CellScale and local BandScale), and scHiCluster, scHiC Topics, Higashi (Table 1 - overview of 8 methods total). Evaluated clustering ARI/silhouette on Ramani2017, Kim2020, Lee2019, and Li2019 scHi-C datasets, 1Mb resolution (Supplementary Table 1). Differential TAD boundaries detection evaluated on TADcompare, diffHiC, CHESS using concordance with bulk data. Tweet

  • Fast-Higashi - scHi-C analysis for precise single cell clustering (rare cell type identification), trajectory inference, differential contact analysis (meta-interactions). Models scHi-C data using tensor decomposition (PARAFAC2 joint factorization algorithm, decomposes chromosome-specific 3-way tensors into four factors, Figure 1, Methods). Partial random walk with restart to impute the data. Applied to three 500kb scHi-C datasets (Tan et al. 2021, :Liu et al. 2021, Lee et al. 2019, and more). Compared with 3DVI, scHiCluster, Higashi (modularity score, ARI, adjusted mutual information, F1 scores), improves detection of rare cell types, trajectories, cell type-specific connections, aggregated A/B compartment analysis. Fast. Can initialize Higashi for better performance. Python/Pytorch code on Zenodo.

    Paper Zhang, Ruochi, Tianming Zhou, and Jian Ma. “Ultrafast and Interpretable Single-Cell 3D Genome Analysis with Fast-Higashi.” Cell Systems 13, no. 10 (October 2022): 798-807.e6. https://doi.org/10.1016/j.cels.2022.09.004.
  • Higashi - hypergraph representation learning for scHi-C embedding (learning node embedding of the hypergraph) and imputation (predicting missing hyperedges within the hypergraph). Whole scHi-C dataset as a hypergraph, with cell nodes and genomic bin nodes. Uses Hyper-SAGNN architecture. Imputation by borrowing information from k-nearest neighbors in the embedding space. Detects TAD-like structures. Applied to 4D Nucleome, Ramani, Nagano scHi-C data. Outperforms HiCCRep/MDS, scHiCluster. LDA in imputation, cell clustering. Can incorporate other omics modalities, and shown improved performance on single-nucleus methyl-3C (sn-m3C-seq) scHi-C and methylation data in human prefrontal cortex cells. Robust to downsampling. A/B compartments detection improved after imputation. Improved detection of TAD-like structures using insulation scores, genes associated with variable boundaries. Methods and supplementary detail network structure, input as triplets of attributes of one cell node and two genomic bin nodes, loss function, training.

  • Hyper-SAGNN - self-attention based graph neural network applicable to homogeneous and heterogeneous hypergraphs. Applied to scHi-C Ramani and Nagano data. Compared with DeepWalk, LINE, and HEBE. Not compared with hyper2vec and node2vec. Outperforms HiCrep+MDS and scHiCluster in measuring scHi-C similarity. Demo of other applications.

  • Zhang, Ruochi, Yuesong Zou, and Jian Ma. "Hyper-SAGNN: a self-attention based graph neural network for hypergraphs." arXiv preprint (November 6, 2019).

  • scHiCTools - scHiCTools - a set of tools for high-level analysis (clustering) of scHi-C data. Project single cells in a lower-dimensional Euclidean space. Three methods for smoothing scHi-C data (linear convolution, random walk, network enhancing), three projection methods (fastHiCRep, Selfish, newly developed InnerProduct), three embedding methods if assuming cells come from a continuous manifold (MDS, t-SNE, PHATE), or three clustering methods if assuming cells are from different clusters (k-means, spectral clustering, HiCluster). Brief Methods of each approach. Tested on Nagano 2017 cell-cycle dataset. InnerProduct captures cell similarity well, any embedding works good, linear convolution and random walk improve projections at high dropout rates. QC plots. ACROC - area under the curve of a circular ROC calculation.  Input - text matrices. Python 3.

  • scHiCluster - single-cell Hi-C clustering algorithm based on imputation using linear convolution (neighborhood smoothing within a window of size 1 over 1Mb scHi-C matrices) and random walk with restarts. scHi-C challenges: variability, sparsity, coverage heterogeneity. Two-step imputation to resolve sparsity, top-ranked interactions after imputation to resolve heterogeneity. Tested on simulated (from bulk Hi-C controlling for sparsity, and pseudobulk) and experimental (Ramani, four human cell lines; Flyamer, mouse zygotes and oocytes; Nagano) scHi-C data. Against PCA, HiCRep+MDS, the eigenvector method, the decay profile method. Adjusted Rand Index to test clustering quality. TAD-like structures can be detected in imputed data (TopDom). At least 5k contacts per cell is sufficient. Python package. Input - sparse matrices, 1Mb resolution, or juicer-pre format for custom resolution.

TAD calling

  • DeDoc2 - scHi-C hierarchical TAD caller. Two variants, deDoc2.w and deDoc2.s, to predict higher and lower level TLDs. Minimize structural entropy of the whole chromosome of sliding window. Benchmarked on downsampled, simulated, and experimental scHi-C data, against Higashi, scHiCluster, deTOKI, SpectralTAD, deDoc, GRiNCH, Insulation Score. Robust to noise, no need for data imputation.
    Paper Li, Angsheng, Guangjie Zeng, Haoyu Wang, Xiao Li, and Zhihua Zhang. “DeDoc2 Identifies and Characterizes the Hierarchy and Dynamics of Chromatin TAD-Like Domains in the Single Cells.” Advanced Science (Weinheim, Baden-Wurttemberg, Germany) 10, no. 20 (July 2023): e2300366. https://doi.org/10.1002/advs.202300366.

3D modeling

  • DPDchrom - reconstruction of the 3D chromatin conformation from single-cell Hi-C data. Relies on dissipative particle dynamics (DPD). Incorporates expectation whether the conformation should be coil-like or globular (at the resolution of 10kb and lower). Explicitly accounts for solvent. Compared with the Stevens method, classical molecular dynamics (CMD) method. Benchmarked on artificial polymer models, DPDchrom performs better at low contact density (up to 95% accuracy). On experimental data - up to 65% accuracy. Propose the Modified Jaccard Index (Methds) to compare 3D structures irrespectively of spatial orientation and scale. Many practical aspects and parameters affecting reconstruction accuracy, data sparsity exponentially affects accuracy. S2 Table - list of single nucleus Hi-C datasets, S1 Appendix - Details of simulation methods and analysis, ORBITA protocol for snHi-C. Tweet by Pavel Kos

Simulation

  • scHi-CSim - a single-cell Hi-C simulator (Python), estimates statistical properties from experimental data and generate simulated data closely resembling experimental (cell type information, biological functions, enhancer-promoter interactions, loops, their statistical significance). Used for clustering benchmarking.
    Paper Fan, Shichen, Dachang Dang, Yusen Ye, Shao-Wu Zhang, Lin Gao, and Shihua Zhang. “scHi-CSim: A Flexible Simulator That Generates High-Fidelity Single-Cell Hi-C Data for Benchmarking.” Edited by Luonan Chen. Journal of Molecular Cell Biology 15, no. 1 (June 1, 2023): mjad003. https://doi.org/10.1093/jmcb/mjad003.

TAD detection

  • scKTLD - TAD-like domain identification on single-cell Hi-C data using graph analysis. Hi-C contact matrix as the adjacency matrix, embeds the graph into a low-dimensional space using a kernel-based changepoint detection, optimized with Pruned Exact Linear Time (PELT). Four types of TAD detection methods, review of single-cell-specific. Experimental bulk (GM12878, K562, downsampled), single-cell Hi-C data, simulated data. ChIP-seq data (CTCF, Rad21, Smc3, H3K4me3) to justify biological relevance. Methods, math. Two hyperparameters, the dimension of the embeddings (128 deemed optimal), the penalty constant in changepoint detection. Normalization (KR or ICE) decreases performance. Compared with 7 TAD callers, including single-cell-specific deTOKI, scHiCluster, and Higashi. Comparison of TAD sets - adjusted mutual information, measure of concordance, TAD-adjR2. Enrichment in CTCF signal (within 500kb up/down flanking), compactness of TADs (the distribution of IFs within TADs). Boundaries in single-cell Hi-C data are heterogeneous irrespectively of cell type, but tend to overlap with boundaries in bulk Hi-C data.
    Paper Liu, Erhu, Hongqiang Lyu, Yuan Liu, Laiyi Fu, Xiaoliang Cheng, and Xiaoran Yin. “Identifying TAD-like Domains on Single-Cell Hi-C Data by Graph Embedding and Changepoint Detection,” https://doi.org/10.1093/bioinformatics/btae138

Papers

  • Galitsyna, Aleksandra A, and Mikhail S Gelfand. “Single-Cell Hi-C Data Analysis: Safety in Numbers.” Briefings in Bioinformatics, August 18, 2021
    • Single-cell Hi-C review, technology overview, analysis steps, challenges, tools. Mapping (split-read alignment, iterative mapping, read clipping, ORBITA), filtering spurious contacts, cells. Analysis, from 3D structure reconstruction, imputation, embedding, to clustering, pseudobulk analysis and AB compartments/TADs calling, deconvolution.

Zhou, Tianming, Ruochi Zhang, and Jian Ma. “The 3D Genome Structure of Single Cells.” Annual Review of Biomedical Data Science, (July 20, 2021) - Review of scHi-C technologies, computational methods.Table 1 - technologies (proximity ligation-based (e.g., sci-Hi-C, Dip-C), ligation-free (e.g., scSPRITE, ChIA-Drop), imaging-based (e.g., Oligopaint, OligoFISSEQ, hiFISH, HIPMap)), number of cells, depth. Data processing (demultiplexing, alignment, binning, filtering, storage, tool - scHiCExplorer), dimensionality reduction (HiCRep + MDS, scHiCluster, hypergraph-based Higashi + Hyper-SAGNN), imputation (scHiCluster, Higashi), Challenges in 3D structure modeling, sompartment annotation, domain/loop identification. Multi-way interaction analysis methods (MIA-Sig, MATCHA).

  • Li, Xiao, Ziyang An, and Zhihua Zhang. “Comparison of Computational Methods for 3D Genome Analysis at Single-Cell Hi-C Level.” Methods, August 2019 - Assessment of Hi-C methods applied to single-cell Hi-C data. Overview of computational analysis of Hi-C data (normalization, A/B compartment, TAD, loop calling, differential analysis), scRNA-seq data properties. Tested on systematically downsampled data and on experimental scHi-C data. HiCnorm is most performing for normalization, Insulation Score fastHiC for TAD/loop calling. A/B compartments are poorly defined in scHi-C data, TADs can be identified at single-cell level, aggregation improves TAD detection. Adjusted mutual information and weight similarity for TAD similarity assessment. Other methods, like TAD boundary prediction from epigenomic features.

Clustering, embedding

  • Kim, Hyeon-Jin, Galip Gürkan Yardımcı, Giancarlo Bonora, Vijay Ramani, Jie Liu, Ruolan Qiu, Choli Lee, et al. “Capturing Cell Type-Specific Chromatin Compartment Patterns by Applying Topic Modeling to Single-Cell Hi-C Data.” PLOS Computational Biology, (September 18, 2020) - Topic modeling (Latent Dirichlet allocation, LDA) on sciHi-C data. 4D Nucleome datasets, 500kb resolution, newly generated data from GM12878, H1ESC, HFF IMR90, HAP1 cells, >19,000 cells. Preprocessing and converting 500Mb scHi-C matrices to locus-pairs (LPs), then tSNE. Cell-topics, LP-topics representation. Topics can capture A/B compartments. LDA using the cisTopic package, procedure for selecting the number of topics. Comparison with scHiCluster, similar perfrormance.

  • Liu, Jie, Dejun Lin, Galip Gürkan Yardimci, and William Stafford Noble. “Unsupervised Embedding of Single-Cell Hi-C Data.” Bioinformatics, (July 1, 2018) - Embedding of scHi-C data. HiCRep with MDS performs best. Contact Probability Function as a means to compare Hi-C matrices. Methods for evaluating reproducibility also can be used to compare matrices, details of HiCRep, GenomeDISCO, HiC-Spector methods. Description of scHi-C datasets and their arrangement by cell cycle stage. 5K total reads per scHi-C matrix is sufficient for proper embedding.

Technologies, data

scHi-C multi-omics

  • HiRES technology, Hi-C and RNA-seq employed simultaneously. Single-cell Hi-C and RNA-seq profiling from the same cells. Single-cell 3D structures depend on cell cycle but also diverge in cell type-specific manner. Interactions between B compartments increase during development. 3D changes occur before transcriptional changes. Brain cells and developing mouse embryos, between day 7 (E7.0) and E11.5. 20kb resolution, agrees with Dip-C. SimpleDiff pipeline for differential chromatin interaction analysis (Wilcoxon on distance-specific Z-score-transformed contacts between groups of cells), excitatory vs. inhibitory adult mouse brain neuron analysis. GSE223917 - processed data, description. Processing Scripts, Python, R, command line.
    Paper Liu, Zhiyuan, Yujie Chen, Qimin Xia, Menghan Liu, Heming Xu, Yi Chi, Yujing Deng, and Dong Xing. “Linking Genome Structures to Functions by Simultaneous Single-Cell Hi-C and RNA-Seq.” Science 380, no. 6649 (June 9, 2023): 1070–76. https://doi.org/10.1126/science.adg3797.
  • Lee, Dong-Sung, Chongyuan Luo, Jingtian Zhou, Sahaana Chandran, Angeline Rivkin, Anna Bartlett, Joseph R. Nery, et al. “Simultaneous Profiling of 3D Genome Structure and DNA Methylation in Single Human Cells.” Nature Methods, September 9, 2019 - sn-m3C-seq - single-nucleus methyl-3C sequencing, extension of snmC-seq2 method, DpnII digestion Fluorescence-Activated Nuclei sorting and the following bisulfite conversion. Cell types can be distinguished by hierarchical clustering (mouse cell types, 4238 human prefrontal cortex cells separated into 14 populations - Astro, Endo, L2/3, L4, L5, L6, MG, MP, Ndnf, ODC, OPC, Pvalb, Sst, Vip, originating from two donors with ages of 21 and 29 years and in a total of five sequencing libraries). TAURUS-MH pipeline, outperforms BWA-METH. sn-m3C-seq methylation correlates well with bulk and single-cell methylation measures. More Hi-C contacts than published datasets. Comparing brain cell subpopulations, chromatin interactions overlap, methylation differ, hypomethylation is associated with increased interactions, differential domain boundaries are associated with differential methylation. mESC data (raw FASTQ, >600 samples, >60Gb), human brain data (raw FASTQ, >4K samples, >700Gb), .cool 10Mb resolution files. Protocol. Interactive methylation data, Hi-C data. Code scripts, TAURUS-MH pipeline, Twitter

  • Li, Guoqiang, Yaping Liu, Yanxiao Zhang, Naoki Kubo, Miao Yu, Rongxin Fang, Manolis Kellis, and Bing Ren. “Joint Profiling of DNA Methylation and Chromatin Architecture in Single Cells.” Nature Methods, August 5, 2019 - Methyl-HiC - in situ Hi-C and WGBS. mESC cells cultured in serum and leukemia inhibitory factor (LIF) condition (serum mESCs: serum 1 and serum 2) and mESCs cultured in LIF with GSK3 and MEK inhibitors (2i) condition. Comparable Hi-C matrices, TADs. 20% fewer CpGs overall, more CpGs in open chromatin. Proximal CpGs correlate irrespectively of loop anchors, weaker for inter-chromosomal interactions. Application to single-cell, mouse ESCs under different conditions. Relevant clustering, cluster-specific genes. Methods for wet-lab and computational processing. Bulk (replicates) and single-cell Methyl-HiC data. Scripts, Bhmem pipeline to map bisulfite-converted reads, Juicer pipeline for processing, VC normalization, HiCRep at 1Mb matrix similarity.

Imaging

  • MERFISH - Super-resolution imaging technology, reconstruction 3D structure in single cells at 30kb resolution, 1.2Mb region of Chr21 in IMR90 cells. Distance maps obtained by microscopy show small distance for loci within, and larger between, TADs. TAD-like structures exist in single cells. 2.5Mb region of Chr21 in HCT116 cells, cohesin depletion does not abolish TADs, only alter their preferential positioning. Multi-point (triplet) interactions are prevalent. TAD boundaries are highly heterogeneous in single cells. , diffraction-limited and STORM (stochastic optical reconstruction microscopy) imaging. GitHub

  • Single-cell level massively multiplexed FISH (MERFISH, sequential genome imaging) to measure 3D genome structure in context of gene expression and nuclear structures. Approx. 650 loci, 50kb resolution, on chr21 10.4-46.7Mb from the hg38 genome assembly, IMR90 cells, population average from approx. 12K chr21 copies, multiple rounds of hybridization. Investigation of TADs, A/B compartments, 87% agreement with bulk Hi-C. Association with cell type markers, transcription. Genome-scale imaging using barcodes, 1041 30kb loci covering autosomes and chrX of IMR90, over 5K cells, 5 replicates. Processed multiplexed FISH data and more, TXT format, GitHub

  • Parser of multiplexed single-cell imaging data from Bintu et al. 2018 and Su et al. 2020 - Take 3D coordinates of the regions as input and write the distance and contact matrices for these datasets.