HiC-Reg: In silico prediction of high-resolution Hi-C interaction matrices

HiC-Reg is a novel approach to predict contact counts from one-dimensional regulatory signals such as epigenetic marks and regulatory protein binding. HiC-Reg provides a powerful framework to generate high-resolution profiles of contact counts that can be used to study individual locus level interactions as well as higher-order organizational units of the genome.

Zhang, S., Chasman, D., Knaack, S. et al. In silico prediction of high-resolution Hi-C interaction matrices. Nat Commun 10, 5449 (2019). https://doi.org/10.1038/s41467-019-13423-8

Step 1: Generate pair features as input for HiC-Reg:

1.0 Aggregate region-level features:

Program in Scripts/aggregateSignalInRegion/

Usage:

aggregateSignal REGION hg19.fa.fai countfile outputfile

Examples:

aggregateSignal hg19_5kbp_chr17.txt hg19.fa.fai wgEncodeBroadHistoneGm12878CtcfStdRawDataRep1_chr17.counts wgEncodeBroadHistoneGm12878CtcfStdRawDataRep1_chr17.txt

Arguments:

REGION: file of 5kb bin with format like hg19_5kbp_chr17.txt
countfile: four columns count file generated from bam files

Input Files:

hg19_5000bp_chr17.txt

chr17	CONVERT	gene	0	4999	.	+	.	chr17_0_4999
chr17	CONVERT	gene	5000	9999	.	+	.	chr17_5000_9999
chr17	CONVERT	gene	10000	14999	.	+	.	chr17_10000_14999

count file: wgEncodeBroadHistoneGm12878CtcfStdRawDataRep1_chr17.counts

chr17	342	389	1
chr17	1005	1056	1
chr17	1060	1111	1
chr17	1346	1397	1

Column1: chromosome Column2: start position Column3: end position Column4: reads (tab deliminated)

1.1 Generate PAIR-CONCAT or WINDOW features:

Program in Scripts/genPairFeatures/

Usage:

./genDatasetsRH HiCsparseMatrix maxdistance foldcv splittype[regionwise|pairwise] featureinputfile correlation[yes|no] outputdir prerandomize_pairs[yes|no] featype[Window|Pconcat]

Arguments:

HiCsparseMatrix: sparse hic matrix
maxdistance: max genomic distance for pair of regions (e.g. 1000000)
foldcv: number of CV folds (e.g. 5)
splittype:
- regionwise--split the total regions in the sparse hic matrix into N folds.
- pairwise--split the total pairs in the sparse hic matrix into N folds.
featureinputfile: input file with path for each feature signal, see Gm12878_norm_featurefiles_test.txt for example.
correlation: calculate the correlation of features in region1 and features in region2 or not. (e.g. no)
outputdir: path for output directory.
prerandomize_pairs: pre-randomize the pairs in the sparse hic matrix or not. (e.g. yes)
featype:
- Pconcat: generate feature signal for region1 and region2.
- Window: generate feature signal for region1 and region2, and average feature signal for the window between these two regions.

Input Files:

HiCsparseMatrix: Gm12878_chr17_5kb_SQRTVC_counts_pairs_100.tab

chr17_0_5000	chr17_5000_10000	485.3207854051
chr17_0_5000	chr17_10000_15000	212.4988640932
chr17_0_5000	chr17_15000_20000	130.1059543547

Column1: region1 Column2: region2 Column3: HiC count (tab deliminated)

featureinputfile: Gm12878_norm_featurefiles_test.txt

H3k4me1	Gm12878_RawData_5000bp_seqdepth_norm_H3k4me1.txt	C
H3k4me2	Gm12878_RawData_5000bp_seqdepth_norm_H3k4me2.txt	C

Column1: mark name Column2: feature signal file Column3: source data type, C for Chip-seq data, M for motif data (tab deliminated)

Example:

./genDatasetsRH Gm12878_chr17_5kb_SQRTVC_counts_pairs_100.tab 1000000 5 regionwise Gm12878_norm_featurefiles_test.txt no out/ yes Window

1.2 Generate MULTI-CELL features:

Program in Scripts/genMULTICELLfeats/

Usage:

python gen-MULTICELL-features.py --inpath ./  --chr 17 --ncv 2 --outpath ./

Arguments:

--inpath: path of input feature data
--chr: chromosome of input feature data
--ncv: number of CV folds
--outpath: path of output MULTI-CELL feature data

Input Files:

WINDOW feature data generated by Step 1.1 for five cell lines: Gm12878, K562, Huvec, Hmec, Nhek. Examples in Scripts/genMULTICELLfeats:

Gm12878_chr17_5kb_test.txt
K562_chr17_5kb_test.txt
Huvec_chr17_5kb_test.txt
Hmec_chr17_5kb_test.txt
Nhek_chr17_5kb_test.txt

Step2: Train HiC-Reg models and make predictions

2.1 Training mode:

Train on training data and Predict on test set:

./regForest -t Examples/Data/Gm12878_chr17_WINDOW_train0.txt -o Examples/out/ -k1 -l10 -n20 -b Examples/Data/prior_window.txt -d Examples/Data/Gm12878_chr17_WINDOW_test0.txt

2.2 Prediction mode:

Predict on test sets with Models trained:

./regForest -t Examples/Data/Gm12878_chr17_WINDOW_train0.txt -o out/ -k1 -l10 -n20 -b Examples/Data/prior_window.txt -d Examples/Data/Gm12878_chr17_WINDOW_test0.txt -s Examples/out/regtree_node

Arguments:

-t is the training data
-o is the output directory
-k is maxfactorsize (1 should be OK)
-l is leaf size (e.g. 10)
-n is the number of trees (e.g. 20)
-b is the prior structure (potential regulators), see Examples/Data/prior_window.txt
-d is the test data
-s is the prefix of saved models (e.g. Examples/out/regtree_node)

Input Files:

training and test data genreated by Step 1: (tab deliminated)

Pair	H3k4me1_E	H3k4me2_E	H3k4me1_P	H3k4me2_P	H3k4me1_W	H3k4me2_W	Distance	Count
chr17_0_5000-chr17_10000_15000	0.714904	0.435131	3.76434	1.3964	0.83982	0.418721	5000	5.36363
chr17_0_5000-chr17_15000_20000	0.714904	0.435131	1.73117	0.869143	2.30208	0.90756	10000	4.87601
chr17_0_5000-chr17_25000_30000	0.714904	0.435131	0.881374	0.708816	1.73773	0.78015	20000	4.76591
chr17_0_5000-chr17_30000_35000	0.714904	0.435131	1.49854	0.760584	1.56646	0.765883	25000	4.5534

header of training and test data should be the same.

prior_window.txt is a file to list the feature set used for training models, that is like this: (tab deliminated)

H3k4me1_E    Count
H3k4me1_P    Count
H3k4me1_W    Count
H3k4me2_E    Count
H3k4me2_P    Count
H3k4me2_W    Count
Distance    Count

Column1: feature set (header of column 2 to second last column) in training/test data, like H3k4me1_E, H3k4me1_P, H3k4me1_W, Distance.

Column2: Count (i.e. the header of the last column in training/test data)

Output Files:

testset_error.txt (Predictions on the test data)

Column1: Pair of two regions
Column2: True HiC Count for this pair in the log scale
Column3: Predicted HiC Count for this pair in the log scale
Column4: squared prediction error
Column5: genomic distance of the two regions

trainset_error.txt (Predictions on the training data)

Column1: Pair of two regions
Column2: True HiC Count for this pair in the log scale
Column3: Predicted HiC Count for this pair in the log scale
Column4: squared prediction error
Column5: genomic distance of the two regions

regtree_node_0.txt (Saved regression tree models for tree 0, examples in Examples/out/)

Application: Make predictions in a new cell line

Train models using training data in Gm12878 and make predictions in K562:

./regForest -t Examples/Data/Gm12878_chr17_WINDOW_train0.txt -o Examples/out/ -k1 -l10 -n20 -b Examples/Data/prior_window.txt -d Examples/Data/K562_chr17_WINDOW_test0.txt

Arguments:

-t is the training data in cell line 1 (Gm12878)
-d is the test data in cell line 2 (K562)
-o is the output directory
-k is maxfactorsize (1 should be OK)
-l is leaf size (e.g. 10)
-n is the number of trees (e.g. 20)
-b is the prior structure (potential regulators), see Data/prior_merge.txt.
-s is the prefix of saved models

Scripts: all scripts for HiC-Reg

aggregateSignalInRegion/

Aggregate region-level features to prepare input for genPairFeatures

genPairFeatures/

Generate PAIR-CONCAT or WINDOW features as input for HiC-Reg

genMULTICELLfeats/

Generate MULTI-CELL features as input for HiC-Reg

LearnModelsPred/

Script for training and predicting in bash script

Evaluation/

CorrelationPlot.m (Run RFdistbinEval.m to make distance-stratified correlation plot and compuete AUC)
RFdistbinEval.m (Compute distance-stratified correlation, global correlation and MSE)

FeatureAnalysis/

Tree path analysis and rules counting of how many times a feature or a pair of features are used to predict a test pair

featureanalysis.sh (bash script to run steps in one)

Roy-lab/HiC-Reg

HiC-Reg: In silico prediction of high-resolution Hi-C interaction matrices

Step 1: Generate pair features as input for HiC-Reg:

1.0 Aggregate region-level features:

Usage:

Examples:

Arguments:

Input Files:

1.1 Generate PAIR-CONCAT or WINDOW features:

Usage:

Arguments:

Input Files:

Example:

1.2 Generate MULTI-CELL features:

Usage:

Arguments:

Input Files:

Step2: Train HiC-Reg models and make predictions

2.1 Training mode:

Train on training data and Predict on test set:

2.2 Prediction mode:

Predict on test sets with Models trained:

Arguments:

Input Files:

Output Files:

Application: Make predictions in a new cell line

Train models using training data in Gm12878 and make predictions in K562:

Arguments:

Scripts: all scripts for HiC-Reg

aggregateSignalInRegion/

genPairFeatures/

genMULTICELLfeats/

LearnModelsPred/

Evaluation/

FeatureAnalysis/