HiC-Reg is a novel approach to predict contact counts from one-dimensional regulatory signals such as epigenetic marks and regulatory protein binding. HiC-Reg provides a powerful framework to generate high-resolution profiles of contact counts that can be used to study individual locus level interactions as well as higher-order organizational units of the genome.
Zhang, S., Chasman, D., Knaack, S. et al. In silico prediction of high-resolution Hi-C interaction matrices. Nat Commun 10, 5449 (2019). https://doi.org/10.1038/s41467-019-13423-8
Program in Scripts/aggregateSignalInRegion/
aggregateSignal REGION hg19.fa.fai countfile outputfile
aggregateSignal hg19_5kbp_chr17.txt hg19.fa.fai wgEncodeBroadHistoneGm12878CtcfStdRawDataRep1_chr17.counts wgEncodeBroadHistoneGm12878CtcfStdRawDataRep1_chr17.txt
- REGION: file of 5kb bin with format like hg19_5kbp_chr17.txt
- countfile: four columns count file generated from bam files
- hg19_5000bp_chr17.txt
chr17 CONVERT gene 0 4999 . + . chr17_0_4999
chr17 CONVERT gene 5000 9999 . + . chr17_5000_9999
chr17 CONVERT gene 10000 14999 . + . chr17_10000_14999
- count file: wgEncodeBroadHistoneGm12878CtcfStdRawDataRep1_chr17.counts
chr17 342 389 1
chr17 1005 1056 1
chr17 1060 1111 1
chr17 1346 1397 1
Column1: chromosome Column2: start position Column3: end position Column4: reads (tab deliminated)
Program in Scripts/genPairFeatures/
./genDatasetsRH HiCsparseMatrix maxdistance foldcv splittype[regionwise|pairwise] featureinputfile correlation[yes|no] outputdir prerandomize_pairs[yes|no] featype[Window|Pconcat]
- HiCsparseMatrix: sparse hic matrix
- maxdistance: max genomic distance for pair of regions (e.g. 1000000)
- foldcv: number of CV folds (e.g. 5)
- splittype:
- regionwise--split the total regions in the sparse hic matrix into N folds.
- pairwise--split the total pairs in the sparse hic matrix into N folds.
- featureinputfile: input file with path for each feature signal, see Gm12878_norm_featurefiles_test.txt for example.
- correlation: calculate the correlation of features in region1 and features in region2 or not. (e.g. no)
- outputdir: path for output directory.
- prerandomize_pairs: pre-randomize the pairs in the sparse hic matrix or not. (e.g. yes)
- featype:
- Pconcat: generate feature signal for region1 and region2.
- Window: generate feature signal for region1 and region2, and average feature signal for the window between these two regions.
- HiCsparseMatrix: Gm12878_chr17_5kb_SQRTVC_counts_pairs_100.tab
chr17_0_5000 chr17_5000_10000 485.3207854051
chr17_0_5000 chr17_10000_15000 212.4988640932
chr17_0_5000 chr17_15000_20000 130.1059543547
Column1: region1 Column2: region2 Column3: HiC count (tab deliminated)
- featureinputfile: Gm12878_norm_featurefiles_test.txt
H3k4me1 Gm12878_RawData_5000bp_seqdepth_norm_H3k4me1.txt C
H3k4me2 Gm12878_RawData_5000bp_seqdepth_norm_H3k4me2.txt C
Column1: mark name Column2: feature signal file Column3: source data type, C for Chip-seq data, M for motif data (tab deliminated)
./genDatasetsRH Gm12878_chr17_5kb_SQRTVC_counts_pairs_100.tab 1000000 5 regionwise Gm12878_norm_featurefiles_test.txt no out/ yes Window
Program in Scripts/genMULTICELLfeats/
python gen-MULTICELL-features.py --inpath ./ --chr 17 --ncv 2 --outpath ./
- --inpath: path of input feature data
- --chr: chromosome of input feature data
- --ncv: number of CV folds
- --outpath: path of output MULTI-CELL feature data
WINDOW feature data generated by Step 1.1 for five cell lines: Gm12878, K562, Huvec, Hmec, Nhek. Examples in Scripts/genMULTICELLfeats:
- Gm12878_chr17_5kb_test.txt
- K562_chr17_5kb_test.txt
- Huvec_chr17_5kb_test.txt
- Hmec_chr17_5kb_test.txt
- Nhek_chr17_5kb_test.txt
./regForest -t Examples/Data/Gm12878_chr17_WINDOW_train0.txt -o Examples/out/ -k1 -l10 -n20 -b Examples/Data/prior_window.txt -d Examples/Data/Gm12878_chr17_WINDOW_test0.txt
./regForest -t Examples/Data/Gm12878_chr17_WINDOW_train0.txt -o out/ -k1 -l10 -n20 -b Examples/Data/prior_window.txt -d Examples/Data/Gm12878_chr17_WINDOW_test0.txt -s Examples/out/regtree_node
- -t is the training data
- -o is the output directory
- -k is maxfactorsize (1 should be OK)
- -l is leaf size (e.g. 10)
- -n is the number of trees (e.g. 20)
- -b is the prior structure (potential regulators), see Examples/Data/prior_window.txt
- -d is the test data
- -s is the prefix of saved models (e.g. Examples/out/regtree_node)
- training and test data genreated by Step 1: (tab deliminated)
Pair H3k4me1_E H3k4me2_E H3k4me1_P H3k4me2_P H3k4me1_W H3k4me2_W Distance Count
chr17_0_5000-chr17_10000_15000 0.714904 0.435131 3.76434 1.3964 0.83982 0.418721 5000 5.36363
chr17_0_5000-chr17_15000_20000 0.714904 0.435131 1.73117 0.869143 2.30208 0.90756 10000 4.87601
chr17_0_5000-chr17_25000_30000 0.714904 0.435131 0.881374 0.708816 1.73773 0.78015 20000 4.76591
chr17_0_5000-chr17_30000_35000 0.714904 0.435131 1.49854 0.760584 1.56646 0.765883 25000 4.5534
header of training and test data should be the same.
- prior_window.txt is a file to list the feature set used for training models, that is like this: (tab deliminated)
H3k4me1_E Count
H3k4me1_P Count
H3k4me1_W Count
H3k4me2_E Count
H3k4me2_P Count
H3k4me2_W Count
Distance Count
Column1: feature set (header of column 2 to second last column) in training/test data, like H3k4me1_E, H3k4me1_P, H3k4me1_W, Distance.
Column2: Count (i.e. the header of the last column in training/test data)
- testset_error.txt (Predictions on the test data)
- Column1: Pair of two regions
- Column2: True HiC Count for this pair in the log scale
- Column3: Predicted HiC Count for this pair in the log scale
- Column4: squared prediction error
- Column5: genomic distance of the two regions
- trainset_error.txt (Predictions on the training data)
- Column1: Pair of two regions
- Column2: True HiC Count for this pair in the log scale
- Column3: Predicted HiC Count for this pair in the log scale
- Column4: squared prediction error
- Column5: genomic distance of the two regions
- regtree_node_0.txt (Saved regression tree models for tree 0, examples in Examples/out/)
./regForest -t Examples/Data/Gm12878_chr17_WINDOW_train0.txt -o Examples/out/ -k1 -l10 -n20 -b Examples/Data/prior_window.txt -d Examples/Data/K562_chr17_WINDOW_test0.txt
- -t is the training data in cell line 1 (Gm12878)
- -d is the test data in cell line 2 (K562)
- -o is the output directory
- -k is maxfactorsize (1 should be OK)
- -l is leaf size (e.g. 10)
- -n is the number of trees (e.g. 20)
- -b is the prior structure (potential regulators), see Data/prior_merge.txt.
- -s is the prefix of saved models
Aggregate region-level features to prepare input for genPairFeatures
Generate PAIR-CONCAT or WINDOW features as input for HiC-Reg
Generate MULTI-CELL features as input for HiC-Reg
Script for training and predicting in bash script
- CorrelationPlot.m (Run RFdistbinEval.m to make distance-stratified correlation plot and compuete AUC)
- RFdistbinEval.m (Compute distance-stratified correlation, global correlation and MSE)
Tree path analysis and rules counting of how many times a feature or a pair of features are used to predict a test pair
- featureanalysis.sh (bash script to run steps in one)