To get started, clone the repo and create a conda environment based on the env.yml file:
git clone https://github.com/stasys-hub/CCUT.git
cd CCUT
mamba env create -f env.yml
mamba activate ccut
coverage run -m pytest -v
git clone https://github.com/stasys-hub/CCUT.git
cd CCUT
sudo docker build --file Dockerfile_mamba --tag ccut-mamba:9428586418 .
sudo docker run --rm -it -v </home/user/local_dir>:</mnt/data> ccut-mamba:9428586418 /bin/bash
Tip
We recommend using mamba as a drop in replacement for conda: Miniforge.
Note
If you want to change the environment name which is by default set to 'ccut' you should change it in env.yml: name: ccut
This will also be the environment name you have to specify while using conda/mamba -> e.g.: mamba acivate your-env-name
.
In the docker installation you will need to mount a local folder containing you data into container using the -v
flag
After Installation you should see a folder structure similar to this:
└── CCUT
│
├── ccut
│ ├── data_prep
│ ├── nn
│ ├── tests
│ └── utils
└── data
data_prep
contains scripts to create sample list and downsample cooler and pairs files, if you want to create your own training setsnn
is the heart of ccut and contains all things related to models- prebuilt models
- the basemodel class if you want to plugin your model
- layers -> contains prebuilt blocks to build your own model
- hooks for the trainer class, to give more control over training
- losses predefined custom losses
- and the trainer class
utils
contains modelus to transform, visualize and load data as theCC_Dataset Class
- you will also find the
main_train.py
file there, which contains some examples to run training
We will update models here: Model-Archive
Please have a look at the tutorials: ccut/Tutorial-inference.ipynb & ccut/Tutorial-training.ipynb. We provide pretrained models for our efficient UNetRRDB network. If you want to load a pre-trained model infer the type and data trained on from the naming-scheme: < modeltype>---- for example: unet-1024-patchsize-porec-4x-tvloss.pth
# import the model
from utils.helpers import get_device
from utils.visualize import plot_mat
import numpy as np
from nn.rrdbunet import UNetRRDB2
# Load a model
unet = UNetRRDB2(in_channels=1, out_channels=1, features=[64, 128, 256, 512, 1024])
unet.load('../checkpoints/rrdbunet_porec-4x-50k-50x50.pth', device=get_device())
To upsample a cooler file, the cooler file has to be converted into a .npz file.
python convert_and_normalize_v3.py <path/to/mcool/file>::/resolutions/<resolution> --prefix <filename-prefix> --output_path <path/to/output/dir> --processes 9 --chromosomes <start_chrom>-<end_chrom> --percentile 99.9 --norm
--percentile n
caps all interactions at a value which is the nth percentile of the Pore-C data. To define a cuttoff at a specific value, use the --cutoff n
.
# Load a npz to enhance
chr19_lr = np.load("<path/to/your/file>.npz")["chr19"]
# predict and upsample chrom
chr19_pred = unet.reconstruct_matrix(lr=chr19_lr, patch_size=40)
# visualization
plot_mat(np.squeeze(chr19_pred))
Before you use the Trainer class some things have to be prepared. The most important thing is to have your coolers in place and have a list of ccordinates which will be used for training. You can generate such lists using the create_sliding_window_coor.v2.py
utility as for example here:
Tip
Use the --help
flag to get some info on the parameters
python create_sliding_window_coor.v2.py --cooler_file /home/muadip/Data/Pairs/SRR11589414_1_v2.mcool::/resolutions/20000 --output_path chr19-22_40x40x20k --resolution 50000 --window_size 40 --chromosome chr19,chr20,chr21,chr22
If you want to train based on coolers that's all you need, despite setting a path to your low and high res coolers in a data.json file (example in ccut/data). If you want to use numpy matrices as input you have to use the Numpy_Dataset class and prepare the matrices with 'convert_and_normalize_v3.py' as for example here:
python convert_and_normalize_v3.py SRR11589414_1_4x_v2.mcool::/resolutions/50000 --prefix <filename_prefix> --output_path <your/outpudir/> --processes 9 --chromosomes 1-18 --cutoff 73 --norm
Note
We recommend to work with min-max normalized data, since the models tend to learn with such data more effectively. To transform your data back to counts after training, one has just to multiply with the cutoff value or the max frequency if no cutoff was applied. If you wish to nor notrmalize leave out the --norm
flag. We recommend using a cutoff of the 99.9th percentile of the dataset, which is about 71 for pore-C and 314 for Micro-C from Krietenstein et al.
DOI: 10.1101/2024.05.29.596528 If you use this tool in your work, we would be really happy if you cite us:
@article{Sys2024,
title = {CCUT: A Versatile and Standardized Framework to Train 3C Deep Restoration Models},
url = {http://dx.doi.org/10.1101/2024.05.29.596528},
DOI = {10.1101/2024.05.29.596528},
publisher = {Cold Spring Harbor Laboratory},
author = {Sys, Stanislav Jur ’Evic and Ceron-Noriega, Alejandro and Kerber, Anne and Weissbach, Stephan and Schweiger, Susann and Wand, Michael and Everschor-Sitte, Karin and Gerber, Susanne},
year = {2024},
month = jun
}