HiCDiffusion

What is HiCDiffusion?

Diffusion-based, from-sequence Hi-C matrices predictor.

If you use HiCDiffusion in your research, we kindly ask you to cite the following publication:

TBD

To test the software using data from C.Origami, download the data from there: https://zenodo.org/record/7226561/files/corigami_data_gm12878_add_on.tar.gz?download=1 And get the hic folder to the main folder of the software. You should also get reference genome and put it into the main folder, e.g., GRCh38_full_analysis_set_plus_decoy_hla.fa (which can be obtained from 1000 Genomes project: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/)

Requirements:

torch
lightning
wandb
torchvision
pandas
numpy
denoising_diffusion_pytorch (https://github.com/lucidrains/denoising-diffusion-pytorch/tree/main)
biopython
pyranges
scikit-image
scipy
hicstraw

The full way of training can be done using the following command:

python run_experiment.py

This command trains the encoder-decoder architecture, then diffusion model built upon that, and then tests it.

Parameters for the default pipeline (as well as ALL the other training scripts) are:

Short option	Long option	Description
-f	--hic_filename	.mcool file that will be used as the dataset. Without this parameter, you need data from the C.Origami paper (it will try to perform a comparison based on their data).
-t	--test_chr	Test chromosome that the pipeline will use only for the testing in last stage.
-v	--val_chr	Validation chromosome that the pipeline will use for determining best model (based on loss on val set). It is not used in training.

Additionally, in train_hicdiff.py and test_hicdiff.py we have:

Short option	Long option	Description
-m	--model	Path to the model (in case of training HiCDiffusion it is encoder-decoder model, in case of testing, it's final diffusion model)