/peakachu

Genome-wide contact analysis using sklearn

Primary LanguagePythonMIT LicenseMIT

NOTE: Peakachu (version>=1.1.2) now supports both .hic and .cool formats.

Introduction

Accurately predicting chromatin loops from genome-wide interaction matrices such as Hi-C data is critical to deepening our understanding of proper gene regulation. Current approaches are mainly focused on searching for statistically enriched dots on a genome-wide map. However, given the availability of orthogonal data types such as ChIA-PET, HiChIP, Capture Hi-C, and high-throughput imaging, a supervised learning approach could facilitate the discovery of a comprehensive set of chromatin interactions. Here, we present Peakachu, a Random Forest classification framework that predicts chromatin loops from genome-wide contact maps. We compare Peakachu with current enrichment-based approaches, and find that Peakachu identifies a unique set of short-range interactions. We show that our models perform well in different platforms, across different sequencing depths, and across different species. We apply this framework to predict chromatin loops in 56 Hi-C datasets, and release the results at the 3D Genome Browser.

Citation

Salameh, T.J., Wang, X., Song, F. et al. A supervised learning framework for chromatin loop detection in genome-wide contact maps. Nat Commun 11, 3428 (2020). https://doi.org/10.1038/s41467-020-17239-9

Installation

Peakachu requires Python3 and several scientific packages to run. It is best to first set up the environment using mamba and then install Peakachu from PyPI:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
mamba create -n peakachu cooler numba scikit-learn=1.1.2 joblib=1.1.0
mamba activate peakachu
pip install -U peakachu

Peakachu should now be installed as a command-line tool within the new environment. Options for all peakachu commands and sub-commands can be accessed with the -h option.

peakachu -h
usage: peakachu [-h] {train,score_chromosome,score_genome,depth,pool} ...

Unveil Hi-C Anchors and Peaks.

positional arguments:
  {train,score_chromosome,score_genome,depth,pool}
    train               Train RandomForest model per chromosome
    score_chromosome    Calculate interaction probability per pixel for a chromosome
    score_genome        Calculate interaction probability per pixel for the whole genome
    depth               Calculate the total number of intra-chromosomal chromatin contacts and select the
                        most appropriate pre-trained model for you.
    pool                Print centroid loci from score_genome/score_chromosome output

options:
  -h, --help            show this help message and exit

Example: predicting loops in GM12878 Hi-C

The following example will download an example cooler file containing the GM12878 Hi-C data at the 10kb resolution, train a series of models using H3K27ac HiChIP interactions, and then predict loops using the trained models.

Data preparation

Peakachu requires the contact map to be a .cool file or a .hic file and any training input to be a text file in bedpe format. Example training data can be found at the training-sets subfolder. Cooler files may be found at the 4DN data portal.

wget http://3dgenome.fsm.northwestern.edu/peakachu/test_file/Rao2014-GM12878-MboI-allreps-filtered.10kb.cool

Train a model and predict loops

It is always a good idea to call the help function immediately before entering a command:

peakachu train -h
usage: peakachu train [-h] [-r RESOLUTION] [-p PATH] [--clr-weight-name CLR_WEIGHT_NAME] [-b BEDPE]
                  [-w WIDTH] [--nproc NPROC] [-O OUTPUT]

options:
  -h, --help            show this help message and exit
  -r RESOLUTION, --resolution RESOLUTION
                        Resolution in bp (default 10000)
  -p PATH, --path PATH  Path to a .cool URI string
  --clr-weight-name CLR_WEIGHT_NAME
                        The name of the weight column in your Cooler URI for normalizing the contact
                        signals. Specify it to "raw" if you want to use the raw signals.
  -b BEDPE, --bedpe BEDPE
                        Path to the bedpe file containing positive training set.
  -w WIDTH, --width WIDTH
                        Number of bins added to center of window. default width=5 corresponds to 11x11
                        windows
  --nproc NPROC         Number of worker processes that will be allocated for training. (default 4)
  -O OUTPUT, --output OUTPUT
                        Folder path to store trained models.
peakachu train -r 10000 -p Rao2014-GM12878-MboI-allreps-filtered.10kb.cool --clr-weight-name weight -O models -b gm12878.mumbach.h3k27ac-hichip.hg19.bedpe

This will train 23 random forest models, each labeled by a chromosome. The model for every chromosome was trained using interactions from all the other 22 chromosomes in the provided bedpe file. The purpose of this is to avoid Peakachu to predict loops from the same map it used for training, without overfitting. To use these models, you may either use the score_chromosome function to predict loops in only one chromosome, or the score_genome function to perform a genome-wide prediction.

peakachu score_chromosome -h
usage: peakachu score_chromosome [-h] [-r RESOLUTION] [-p PATH] [--clr-weight-name CLR_WEIGHT_NAME]
                             [-C CHROM] [-m MODEL] [-l LOWER] [-u UPPER] [--minimum-prob MINIMUM_PROB]
                             [-O OUTPUT]

options:
  -h, --help            show this help message and exit
  -r RESOLUTION, --resolution RESOLUTION
                        Resolution in bp (default 10000)
  -p PATH, --path PATH  Path to a .cool URI string
  --clr-weight-name CLR_WEIGHT_NAME
                        The name of the weight column in your Cooler URI for normalizing the contact
                        signals. Specify it to "raw" if you want to use the raw signals.
  -C CHROM, --chrom CHROM
                        Chromosome label. Only contact data within the specified chromosome will be
                        considered.
  -m MODEL, --model MODEL
                        Path to pickled model file.
  -l LOWER, --lower LOWER
                        Lower bound of distance between loci in bins (default 6).
  -u UPPER, --upper UPPER
                        Upper bound of distance between loci in bins (default 300).
  --minimum-prob MINIMUM_PROB
                        Only output pixels with probability score greater than this value (default 0.5)
  -O OUTPUT, --output OUTPUT
                        Output file name.
peakachu score_chromosome -r 10000 -p Rao2014-GM12878-MboI-allreps-filtered.10kb.cool --clr-weight-name weight -O GM12878-chr2-scores.bedpe -C chr2 -m models/chr2.pkl 
peakachu pool -r 10000 -i GM12878-chr2-scores.bedpe -o GM12878-chr2-loops.bedpe -t 0.9

The pool function serves to select the most significant non-redundant results from per-pixel probabilities calculated by the score functions. It is recommended to try different probability thresholds to achieve the best sensitivity-specificity tradeoff. The output is a standard bedpe file with the 7th and the final column containing the predicted probability from the random forest model and the interaction frequency extracted from the contact matrix, respectively, to support further filtering. The results can be visualized in juicebox or higlass by loading as 2D annotations. Here is an example screenshot of predicted GM12878 loops in juicer: Predicted loops from model trained on H3K27ac HiChIP interactions

Using Peakachu as a standard loop caller

Models for predicting loops in Hi-C have been trained using CTCF ChIA-PET interactions, H3K27ac HiChIP interactions, and a high-confidence loop set (loops that can be detected by at least two orthogonal methods from CTCF ChIA-PET, Pol2 ChIA-PET, Hi-C, CTCF HiChIP, H3K27ac HiChIP, SMC1A HiChIP, H3K4me3 PLAC-Seq, and TrAC-Loop) as positive training samples, at a variety of read depths. Simply download the appropriate model file and directly run the score_genome/score_chromosome function if you want to detect chromatin loops on your own Hi-C or Micro-C maps.

If you are using Peakachu>=2.0, please select a model from the following table:

Total intra reads high-confidence (5kb) high-confidence (10kb) high-confidence (25kb)
2 billion total 5kb total 10kb total 25kb
1.8 billion 90% 5kb 90% 10kb 90% 25kb
1.6 billion 80% 5kb 80% 10kb 80% 25kb
1.4 billion 70% 5kb 70% 10kb 70% 25kb
1.2 billion 60% 5kb 60% 10kb 60% 25kb
1 billion 50% 5kb 50% 10kb 50% 25kb
900 million 45% 5kb 45% 10kb 45% 25kb
850 million 42.5% 5kb 42.5% 10kb 42.5% 25kb
800 million 40% 5kb 40% 10kb 40% 25kb
750 million 37.5% 5kb 37.5% 10kb 37.5% 25kb
700 million 35% 5kb 35% 10kb 35% 25kb
650 million 32.5% 5kb 32.5% 10kb 32.5% 25kb
600 million 30% 5kb 30% 10kb 30% 25kb
550 million 27.5% 5kb 27.5% 10kb 27.5% 25kb
500 million 25% 5kb 25% 10kb 25% 25kb
450 million 22.5% 5kb 22.5% 10kb 22.5% 25kb
400 million 20% 5kb 20% 10kb 20% 25kb
350 million 17.5% 5kb 17.5% 10kb 17.5% 25kb
300 million 15% 5kb 15% 10kb 15% 25kb
250 million 12.5% 5kb 12.5% 10kb 12.5% 25kb
200 million 10% 5kb 10% 10kb 10% 25kb
150 million 7.5% 5kb 7.5% 10kb 7.5% 25kb
100 million 5% 5kb 5% 10kb 5% 25kb
50 million 2.5% 5kb 2.5% 10kb 2.5% 25kb
30 million 1.5% 5kb 1.5% 10kb 1.5% 25kb
10 million 0.5% 5kb 0.5% 10kb 0.5% 25kb
5 million 0.25% 5kb 0.25% 10kb 0.25% 25kb

Instead, if you are using an older Peakachu version (<2.0), please select a model from this table:

Total intra reads CTCF Models (10kb) H3K27ac Model (10kb)
2 billion CTCF total H3K27ac total
1.8 billion CTCF 90% H3K27ac 90%
1.6 billion CTCF 80% H3K27ac 80%
1.4 billion CTCF 70% H3K27ac 70%
1.2 billion CTCF 60% H3K27ac 60%
1 billion CTCF 50% H3K27ac 50%
800 million CTCF 40% H3K27ac 40%
600 million CTCF 30% H3K27ac 30%
400 million CTCF 20% H3K27ac 20%
200 million CTCF 10% H3K27ac 10%
30 million CTCF 1.5% H3K27ac 1.5%

To make it clear, let's download another Hi-C dataset:

wget -O SKNAS-MboI-allReps-filtered.mcool -L https://www.dropbox.com/s/f80bgn11d7wfgq8/SKNAS-MboI-allReps-filtered.mcool?dl=0

Peakachu provides a handy function peakachu depth to extract the total number of intra-chromosomal pairs in your data and help you select the most appropriate pre-trained model:

peakachu depth -p SKNAS-MboI-allReps-filtered.mcool::resolutions/1000000

The output of above command will be:

num of intra reads in your data: 141955751
num of intra reads in a human with matched sequencing coverage: 139325229
suggested model: 150 million

Therefore, we recommend using the 7.5% models (trained with ~150 million intra reads) to predict loops on this data.

peakachu score_genome -r 10000 --clr-weight-name weight -p SKNAS-MboI-allReps-filtered.mcool::resolutions/10000 -O SKNAS-peakachu-10kb-scores.bedpe -m high-confidence.150million.10kb.w6.pkl
peakachu pool -r 10000 -i SKNAS-peakachu-10kb-scores.bedpe -o SKNAS-peakachu-10kb-loops.0.95.bedpe -t 0.95

Not just Hi-C

In addition to Hi-C, Peakachu has also been trained on other 3D genomic platforms with good results, including Micrco-C (Krietenstein et al. 2020), DNA SPRITE (Quinodoz et al. 2018), ChIA-PET (Fullwood et al. 2009), HiChIP (Mumbach et al. 2016), TrAC-loop (Lai et al. 2018), and HiCAR (Wei et al. 2022), etc.

If you want to predict loops on HiCAR contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and a series of downsampled versions of a HiCAR dataset in H1ESC cells. As these models were trained using the raw contact values (rather than the ICE-normalized contact values as we did for Hi-C), please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
300 million 300 million 300 million
275 million 275 million 275 million
250 million 250 million 250 million
225 million 225 million 225 million
200 million 200 million 200 million
175 million 175 million 175 million
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

If you want to predict loops on TrAC-loop contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and a series of downsampled versions of a TrAC-loop dataset in H1ESC cells. Again, please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

If you want to predict chromatin loops on CTCF ChIA-PET contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and downsampled versions of a CTCF ChIA-PET dataset in H1ESC cells. Please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
275 million 275 million 275 million
250 million 250 million 250 million
225 million 225 million 225 million
200 million 200 million 200 million
175 million 175 million 175 million
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

If you want to predict loops on Pol2 (RNA Polymerase II) ChIA-PET contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and downsampled versions of a Pol2 ChIA-PET dataset in WTC11 cells. Please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
200 million 200 million 200 million
175 million 175 million 175 million
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

If you want to predict loops on H3K27ac HiChIP/PLAC-Seq contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and downsampled versions of a H3K27ac HiChIP dataset in GM12878 cells. Please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
275 million 275 million 275 million
250 million 250 million 250 million
225 million 225 million 225 million
200 million 200 million 200 million
175 million 175 million 175 million
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

If you want to predict loops on H3K4me3 HiChIP/PLAC-Seq contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and downsampled versions of a H3K4me3 PLAC-Seq dataset in GM12878 cells. Please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
175 million 175 million 175 million
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

If you want to predict loops on CTCF HiChIP/PLAC-Seq contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and downsampled versions of a CTCF HiChIP dataset in GM12878 cells. Please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
200 million 200 million 200 million
175 million 175 million 175 million
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

If you want to predict loops on SMC1A HiChIP/PLAC-Seq contact matrices, please select a model from the following table. The models were trained with a high-confidence loop set and downsampled versions of a SMC1A HiChIP dataset in GM12878 cells. Please set the parameter "--clr-weight-name" to "raw" when you run "peakachu score_genome" or "peakachu score_chromosome".

2kb models 5kb models 10kb models
225 million 225 million 225 million
200 million 200 million 200 million
175 million 175 million 175 million
150 million 150 million 150 million
125 million 125 million 125 million
100 million 100 million 100 million
90 million 90 million 90 million
80 million 80 million 80 million
70 million 70 million 70 million
60 million 60 million 60 million
50 million 50 million 50 million
40 million 40 million 40 million
30 million 30 million 30 million
20 million 20 million 20 million
10 million 10 million 10 million

Release Notes

Version 2.2 (05/21/2023)

  1. made changes to make sure the behavior of the local clustering algorithm the same at different resolutions.
  2. fixed a bug when the input contact matrix is extremely sparse

Version 2.1 (11/28/2022)

  1. Fixed a bug regarding model training using the raw contact values

Version 2.0 (09/06/2022)

  1. Re-trained the models using the latest scikit-learn v1.1.2
  2. Used the distance-normalized signals instead of original contact signals
  3. Added a 2D Gaussian filter followed by min-max scaling to pre-process each training image
  4. Optimized the computation efficiency using numba and matrix operations.