/LoopSage

An energy-based model for loop extrusion.

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

LoopSage

An energy-based model for loop extrusion.

Important!!! Now the molecular dynamics can run for larger regions, since the code or LoopSage_md.py is fixed and does not allocate so much memory.

Publication

Please cite the method paper in case that you would like to use this model for your work,

  • Korsak, Sevastianos, and Dariusz Plewczynski. "LoopSage: An energy-based Monte Carlo approach for the loop extrusion modelling of chromatin." Methods (2024).

The model

Let's assume that each cohesin $i$ can be represented of two coordinates $(m_{i},n_{i})$ we allow three moves in our simulation:

  • Slide right ( $n_{i} -> n_{i+1}$ to the right).
  • Slide left ( $m_{i} -> m_{i-1}$ to the left).
  • Rebind somewhere else.

The main idea of the algorithm is to ensemble loop extrusion from a Boltzmann probability distribution, with Hamiltonian,

$$E = \dfrac{f}{N_{fold}}\sum_{i=1}^{N_{coh}}\log(n_i-m_i)+\dfrac{\kappa}{N_{cross}}\sum_{i,j}K(m_i,n_i;m_j,n_j)+\dfrac{b}{N_{bind}}\sum_{i=1}^{N_{coh}}\left(L(m_i)+R(n_i)\right)$$

The first term corresponds to the folding of chromatin, and the second term is a penalty for the appearance of crosss. Therefore, we have the function, $K(m_{i},n_{i};m_{j},n_{j})$ which takes the value 1 when $m_{i} < m_{j} < n_{i} < n_{j}$ or $m_{i}=m_{j}$ or $m_{i}=n_{j}$.

These $L(\cdot), R(\cdot)$ functions are two functions that define the binding potential and they are orientation specific - so they are different for left and right position of cohesin (because CTCF motifs are orientation specific), therefore when we have a gap in these functions, it means presence of CTCF. These two functions are derived from data with CTCF binning and by running the script for probabilistic orientation. Moreover, by $N_{(\cdot)}$ we symbolize the normalization constants for each factor,

$$N_{fold}=N_{coh}\cdot \langle n_i-m_i\rangle,\quad N_{cross}=N_{coh},\quad N_{bind}=\sum_{k}\left(L(k)+R(k)\right).$$

An additional term which allows other protein factors that may act as barriers for the motion of LEFs can me optionally added,

$$E_{bw}=\sum_{i=1}^{N_{\text{bw}}}\frac{r_{i}}{N_{bw,i}}\sum_{j=1}^{N_{lef}}\left(W_i(m_j)+W_i(n_j)\right),\qquad N_{bw,i} = \sum_{k}W_i(k)$$

where $N_{bw}$ is the number of \texttt{.BigWig} ChIP-Seq experiments that are imported for each one of the proteins of interest. The energy term $W_{i}(\cdot)$ it corresponds to the average values of the ChIP-Seq experiment $i$ for each loci of interest. Finally, $r_{i}$ it is a weight that corresponds each one of these experiments and by default it is set as $r_{i}=b/2$.

Thus,

$$\Delta E = \Delta E_{fold}+\Delta E_{cross}+\Delta E_{bind} + \Delta E_{bw}.$$

In this manner we accept a move in two cases:

  • If $\Delta E<0$ or,
  • if $\Delta E>0$ with probability $e^{-\Delta E/kT}$.

And of course, the result - the distribution of loops in equilibrium - depends on temperature of Boltzmann distribution $T$.

Packages to install

Install a python 3.10 environment and OpenMM 8.0 and run

pip install -r requirements.txt

For faster computations use CUDA toolkit: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=20.04&target_type=deb_local.

How to use?

The main script is LoopSage.py. The implementation of the code is very easy and it can be described in the following lines,

N_steps, MC_step, burnin, T, T_min = int(1e4), int(2e2), 1000, 5,1
region, chrom = [88271457,88851999], 'chr10'
label=f'Petros_wt1h'
bedpe_file = '/mnt/raid/data/Petros_project/loops/wt1h_pooled_2.bedpe'
coh_track_file = '/mnt/raid/data/Petros_project/bw/RAD21_ChIPseq/mm_BMDM_WT_Rad21_heme_60min.bw'
bw_file1 = '/mnt/raid/data/Petros_project/bw/BACH1_ChIPseq/mm_Bach1_1h_rep1_heme_merged.bw'
bw_file2 = '/mnt/raid/data/Petros_project/bw/RNAPol_ChIPseq/WT-RNAPOLIIS2-1h-heme100-rep1_S5.bw'
bw_files = [bw_file1,bw_file2]

sim = LoopSage(region,chrom,bedpe_file,label=label,N_lef=50,N_beads=1000,bw_files=bw_files,track_file=coh_track_file)
Es, Ms, Ns, Bs, Ks, Fs, ufs = sim.run_energy_minimization(N_steps,MC_step,burnin,T,T_min,poisson_choice=True,mode='Annealing',viz=True,save=True)
sim.run_MD()

Firstly, we need to define the input files from which LoopSage would take the information to construct the potential. We define also the specific region that we would like to model. Therefore, in the code script above we define a bedpe_file from which information about the CTCF loops it is imported. In coh_track_file you can optionally define the track file with some cohesin coverage of ChIP-Seq to determine the distribution of LEFs and allow preferencial binding in regions with higher signal. Then the user can optionally define a list of BigWig files which are needed in case that user would like to model other protein factors and their coefficients $r_i$.

Note that the .bedpe_file must be in the following format,

chr1	903917	906857	chr1	937535	939471	16	3.2197903072213415e-05	0.9431392038374097
chr1	979970	987923	chr1	1000339	1005916	56	0.00010385804708107556	0.9755733944997329
chr1	980444	982098	chr1	1063024	1065328	12	0.15405319074060866	0.999801529750033
chr1	981076	985322	chr1	1030933	1034105	36	9.916593137693526e-05	0.01617512105347667
chr1	982171	985182	chr1	990837	995510	27	2.7536240913152036e-05	0.5549511180231224
chr1	982867	987410	chr1	1061124	1066833	71	1.105408615726611e-05	0.9995462969421808
chr1	983923	985322	chr1	1017610	1019841	11	1.7716275555648395e-06	0.10890923034907056
chr1	984250	986141	chr1	1013038	1015474	14	1.7716282101935205e-06	0.025665007111095667
chr1	990949	994698	chr1	1001076	1003483	34	0.5386388489931403	0.9942742844900859
chr1	991375	993240	chr1	1062647	1064919	15	1.0	0.9997541297643132

where the last two columns represent the probabilites for left and right anchor respectively to be tandem right. If the probability is negative it means that no CTCF motif was detected in this anchor. You can extract these probabilities from the repo: https://github.com/SFGLab/3d-analysis-toolkit, with find_motifs.py file.

Then, we define the main parameters of the simulation N_beads,N_coh,kappa,f,b or we can choose the default ones (take care because it might be the case that they are not the appropriate ones and they need to be changed), the parameters of Monte Carlo N_steps, MC_step, burnin, T, and we initialize the class LoopSage(). The command sim.run_energy_minimization() corresponds to the stochastic Monte Carlo simulation, and it produces a set of cohesin constraints as a result (Ms, Ns). Note that the stochastic simulation has two modes: Annealing and Metropolis. We feed cohesin constraints to the molecular simulation part of and we run MD_LE() or MD_EM() simulation which produces a trajectory of 3d-structures, and the average heatmap. MD_LE() function can produce an actual trajectory and a .dcd video of how simulation changes over time. However, the implementation needs a high amount of memory since we need to define a bond for each time step, and it may not work for large systems. EM_LE() is suggested for large simulations, because it does not require so big amount of memory.

Output Files

In the output files, simulation produces one folder with 4 subfolders. In subfolder plots, you can find plots that are the diagnostics of the algorithm. One of the most basic results you should see is the trajectories of cohesins (LEFs). this diagram should look like that,

coh_trajectories

In this diagram, each LEF is represented by a different colour. In case of Simulated Annealing, LEFs should shape shorter loops in the first simulation steps, since they have higher kinetic energy due to the high temperature, and very stable large loops in the final steps if the final temperature $T_f$ is low enough. Horizontal lines represent the presence of CTCF points. In case of Metropolis, the distribution of LEFs should look invariant in respect to computational time,

coh_trajectories

Good cohesin trajectory diagrams should be like the ones previously shown, which means that we do not want to see many unoccupied (white) regions, but we also do not like to see static loops. If the loops are static then it is better to choose higher temperature, or bigger number of LEFs. If the loops are too small, maybe it is better to choose smaller temperature.

Now, to reassure that our algorithm works well we need to observe some fundamental diagnostics of Monte Carlo algorithms. Some other important diagnostics can be seen in the following picture,

github_diagnostics

In graph A, we can see the plot of the energy as a function of simulation time. In Metropolis after some steps, the simulation should reach equillibrium after the defined burnin period (blue vertical line). In case of Simulated Annealing, the energy should decrease as function of time because the temperature decreases, and thus the energy should not be conserved. Autocorrelations (B), show us if the thermodyncamic ensembles that we obtained are autocorrelated or not. The Monte Carlo step (sampling frequency) should be big enough so as to have small autocorrelations (<0.5). The averaged heatmap (C), shows the simulated heatmap, after averaging inversed all-versus-all distances of the region of interest. Finally, (D) shows the Pearson correlation between projected 1D signal of heatmap of experimental and simulated data.

In the output folder there are another three subfolders:

  • pdbs has the ensembles of 3D structures in .cif format (it can open with vizualization software Chimera: https://www.cgl.ucsf.edu/chimera/),
  • heatmaps with the inversed all-versus-all distance heatmap of each one of these structures.
  • other here are some numpy arrays and some computed statistics. Numpy arrays like Ms and Ns have the degrees of freedoms of LEFs over time, then Fs, Ks, Es they have folding, corssing energy and total energy over time. Ts is the temperature over time. And finally in other.txt you can see the statistics of the simulation and the input parameters. In correlations.txt you can find a file with Pearson, Spearmann and Kendau tau correlations between estimated and experimental data. We provide an optimistic simulations where zeros of the signal are taken into account, and a pessimistic one where the zeros are not taken into account.

An example, illustrated with Chimera software, simulated trajectory of structures after running Simulated Annealing and molecular dynamics.

siman_traj_GitHub