CombMotif is a method based on NeuronMotif and motif mutagenesis, which not only enables the discovery of diverse and high-quality motifs but also efficiently identifies motif interactions in mRNA. By employing this method, we systematically analyzed the learned motif syntax of two types of deep learning models, namely MRL predictor and Half-life predictor. The results of our interpretation for both models align with known biological phenomena, and include some unknown motif syntax, providing novel insights for biologists.
git clone https://git.tsinghua.edu.cn/zengxc22/combmotif.git
cd combmotif
This repository is tested on Python 3.7 , PyTorch 1.10.0+cu111, meme5.4.1. You could create a virtual environment and install dependencies with the following command.
conda create -n combmotif python=3.7
conda activate combmotif
pip install -r requirements.txt
Data used for training and interpretation can be downloaded from here, or generated by yourself through the script in generate_dataset directory. Well-trained model weights can be downloaded from here. The downloaded filefolder should be placed under the root folder.
We trained two model, namely Half-life predictor and MRL(Mean Ribosome Load) predictor. They share the same architecture of a hybird network consists of CNN and GRU, which is proposed by Agarwal1. The model weights can be downloaded in the following links. The downloaded model weights should be placed in the model_weights folder under the root folder.
Model name | Dataset | Weights | Performance | Params | Num_layers |
---|---|---|---|---|---|
Half-life predictor | f0_c0_data0_wholeseq.h5 | hl_predictor | r=0.73 | 16w | 6 |
MRL predictor | GSM3130435_egfp_unmod_1.csv | mrl_predictor | r2=0.94 | 9w | 3 |
MRL predictor noAUG | utr_mrl_non_AUG_alldataset.csv | noAUG | r2=0.40 | 9w | 3 |
Half-life predictor mouse | f0_c0_data1_wholeseq.h5 | mouse | r=0.66 | 16w | 6 |
To train half-life predictor or MRL predictor from scratch please refer to mrl or half-life.
Here, We implement NeuronMotif2 by PyTorch. With neuronmotif, we have discovered numerous biologically meaningful motifs and motif combinations in mRNA. In addition, we have also explored other interpretable methods, such as TF-MoDISco3, Maximum ActivationSeqlet. Assumed that you have obtained the well-trained model weights and installed meme5.4.1, now you can start interpreting models with NeuronMotif. For a quick start, here we are trying to interpret the neuron1~neuron4 in the 2nd conv layer of Half-life predictor.
cd motif_discovery/single_thread_script
bash main.sh 2 2 ../configs/interpreting/hl_predictor_quick_start.yaml
The interpretation results primarily consist of two parts. The first part is the visualization of each neuron, located in the motif_discovery/clustering/hl_predictor/conv2-mechanic folder. The visualization file is an HTML file, and for visualization, the JavaScript tools from the "utils" folder need to be placed in the same directory.
--js
--jseqlogo.js
--conv2_neuron1.html
The second part is the comparison results of the motifs found with standard motif database by tomtom, which are stored in the motif_discovery/tomtom_match_results/hl_predictor folder. If you want to interpret each neuron of half-life predictor or other models, please refer to motif discovery for more details.
It is important to explore the contribution of each neuron to the final model predictions. We assume that if the model's prediction is higher when a neuron is activated (activation value exceeds a certain threshold) compared to when it is not activated, then we consider this neuron as a positive neuron. The motif visualized by this neuron is also considered a positive motif.
For a quick start, we only search the max activation of the neurons in conv2 by running following command.
cd motif_contribution/script
bash run_search_maxact.sh 2 2 ../configs/interaction/hl_predictor.yaml
bash run_contribution.sh 2 2 ../configs/interaction/hl_predictor.yaml
You will obtain the final results in motif_contribution/results/hf_predictor_neuron_contribution.csv. Now you should have a general understanding of whether each neuron enhances or suppresses half-life. Afterward, you can infer whether the contribution of each motif visualized by each neuron is positive or negative based on the previously interpretation results.
We applied motif mutagenesis to study the interactions between motifs. We consider two types of interactions: synergistic or antagonistic epistasis. We only consider motif pairs discovered by NeuronMotif, as these pairs are more likely to have interactions.
First, we will identify sequences in the dataset that contain the corresponding motif pairs. For simplicity, we will only consider Conv5.
cd motif_interaction/script
bash run_search_maxact.sh 5 5 ../configs/interaction/hl_predictor.yaml
bash run_fragment_location.sh 5 ../configs/interaction/hl_predictor.yaml
The output file can be found in motif_interaction/fragment_location/hl_predictor. Take conv5_neuron2_fragment_location.yaml for example.
Seq9510_0: 86
Seq9519_0: 702
Seq9588_0: 5373
Seq9751_0: 1139
Seq9776_0: 2582
Seq9832_0: 2111
Seq9913_0: 2007
Seq9978_0: 1018
Seq9510_0: 86 means that the sequence slice between the 86th and 157st nucleotides of the 9,510th sequence in the training set can activate conv5_neuron2. Because the receptive field of conv5 is 72, the length of the sequence slice is 72nt.
We assume that conv5_neuron2 has learned the combination of TGTANA and GGAC, and the sequence slices in conv5_neuron2_fragment_location.yaml are very likely to contain the corresponding motif combination. Then we filter the sequence slices containing the motif pair from conv5_neuron2_fragment_location.yaml and perform motif mutagenesis on these sequence. Finally we perform a Wilcoxon signed-rank test on their model predictions.
bash run_scr.sh 5 0 64 ../configs/interaction/hl_predictor.yaml
The results can be found in motif_interaction/results/scramble_res/hl_predicto.
-
Agarwal, Vikram, and David R. Kelley. "The genetic and biochemical determinants of mRNA degradation rates in mammals." Genome Biology 23.1 (2022): 245.
-
Wei, Zheng, et al. "NeuronMotif: Deciphering cis-regulatory codes by layer-wise demixing of deep neural networks." _Proceedings of the National Academy of Sciences_ 120.15 (2023): e2216698120.
-
Shrikumar, Avanti, et al. "Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5. 6.5." _arXiv preprint arXiv:1811.00416_ (2018).