CCIP is a machine learning method for predicting CTCF-mediated chromatin loops with transitivity.
CTCF-mediated chromatin loops underlie the formation of topological associating domains (TADs) and serve as the structural basis for transcriptional regulation. However, the formation mechanism of these loops remains unclear,and the genome-wide mapping of these loops is costly and difficult.
Motivated by recent process on the formation mechanism of CTCF-mediated loops, we studied the possibility of making use of transitivity-related information of interacting CTCF anchors to predict CTCF loops computationally. In this context,transitivity arises when two CTCF anchors interact with a same third anchor by the loop extrusion mechanism and bring themselves close to each other spatially to form an indirect loop.
We proposed an accurate and efficient two-stage random-forest-based machine learning method, CCIP (CTCF-mediated Chromatin Interaction Prediction), to predict CTCF-mediated chromatin loops. Our two-stage learning approach makes it possible for us to train a prediction model by taking advantage of transitivity-related information as well as functional genome data and genomic data.
CCIP could be installed in a linux-like system. The CCIP requires the following dependencies. We recommend to use Anaconda python distribution for installation of the below packages.
- Python (tested 3.6.10)
- numpy (tested 1.18.1)
- pandas (tested 0.24.2)
- matplotlib (tested 3.1.1)
- networkx (tested 2.4)
- scikit-learn (tested 0.22.1)
- joblib (tested 0.14.1)
- bedtools (tested 2.29.2)
Download CCIP by
git clone https://github.com/gaolabXDU/CCIP
Required data for GM12878, HeLa-S3, K562 and MCF-7 cell line are available in the CCIP/data directory. CTCF motif and CTCF age data are common for these cell lines while CTCF ChIP-seq and RAD21 ChIP-seq data are specific for these cell lines. you can use the shell script CCIP/code/test.sh to test the software. There are three main scripts for generating samples (generate_pairs.py), extracting features (generate_features.py) and training the predicting models (rf_graph.py).
The script generate_pairs.py is used to generate positive and negative samples for CCIP.
Output: pair_all_balance.csv
Usage: generate_pairs.py [options]
Options:
-h|--help: show this help message and exit
-o|--output_path: Path for output
-c|--ctcf_file: CTCF ChIP-seq data
-m|--ctcf_motif_file: CTCF motif occurence data
-p|--chia_pet_file: CTCF ChIA-PET datahao
The script generate_features.py is used to extract features for samples generated from last step.
Output: samples.csv
Usage: generate_features.py [options]
Options:
-h|--help: show this help message and exit
-o|--output_path: Path for output
-c|--ctcf_file: CTCF ChIP-seq data
-m|--ctcf_motif_file: CTCF motif occurence data
-p|--chia_pet_file: CTCF ChIA-PET data
-r|--rad21_file: RAD21 ChIP-seq data
-a|--age_file: CTCF age data
The script rf_graph.py is used to train the model and do ten fold cross validation.
Output: rf_base.model, rf_graph.model, cross_val_predict_prob.npy
Usage: rf_graph.py [options]
Options:
-h|--help: show this help message and exit
-i|--input_file: Samples for training the model
-o|--output_path: Output path for store the training results
The script rf_graph_chrom_cv.py is the cross chromosome validation version of rf_graph.py.
Output: rf_base.model, rf_graph.model, cross_val_predict_prob.npy
Usage: rf_graph_chrom_cv.py [options]
Options:
-h|--help: show this help message and exit
-i|--input_file: Samples for training the model
-o|--output_path: Output path for store the training results
The script rf_graph_test.py is used to predict the samples from one cell type using the trained model from another cell type.
Output: %s_%s_ccip_prob.npy (predicted probability of each sample)
Usage: rf_graph_test.py [options]
Options:
-h|--help: show this help message and exit
-o|--output_path: Output path for storing the testing results
-m|--model_path: Model file for predicting
-s|--sample_file: Sample file for predicting
-M|--model_cell: Model cell for predicting
-S|--sample_cell: Sample cell for predicting
For example, in CCIP/code/, we can run these scripts for GM12878 cell line:
python generate_pairs.py -c ../data/GM12878/CTCF_peak.bed\
-m ../data/GM12878/fimo.csv\
-p ../data/GM12878/gm12878_ctcf.interactions.intra.bedpe\
-o ../data/GM12878/output
python generate_features.py -c ../data/GM12878/CTCF_peak.bed\
-m ../data/GM12878/fimo.csv\
-p ../data/GM12878/gm12878_ctcf.interactions.intra.bedpe\
-o ../data/GM12878/output\
-r ../data/GM12878/rad21.narrowPeak\
-a ../data/GM12878/CTCF_age.bed
python rf_graph.py -i ../data/GM12878/output/sample.csv -o ../data/GM12878/output/model