CRISPR-M is a novel multi-view deep learning model with a new feature encoding scheme, regarding sgRNA off-target effect prediction for target sites containing indels and mismatches. CRISPR-M takes advantage of convolutional neural networks and bidirectional long short-term memory recurrent neural networks to construct a three-branch network towards multi-views. Compared to existing methods, CRISPR-M demonstrates significant performance advantages running on real-world datasets. Furthermore, experimental analysis of CRISPR-M under multiple metrics reveals its capability to extract features and validates its superiority on sgRNA off-target effect predictions (in /test).
Here is the introduction of files in /codes
. The /codes
directory encompasses essential components for data processing, encoding, and modeling. Functions for encoding data, computing metrics, and preprocessing raw data are provided. Additionally, there are backup files for code preservation, though they are not actively used in the current project. Notably, the main program for this project is located in the test directory.
file | content |
---|---|
encoding.py |
* Content: Contains data encoding functions. * Purpose: Likely involved in encoding or converting data for further processing in the project. |
metrics_utils.py |
* Content: Contains functions for computing metrics. * Purpose: Used for evaluating and measuring the performance of the implemented models or algorithms. |
data_preprocessing_utils.py |
* Content: Contains data preprocessing functions. * Purpose: Involved in preparing and cleaning the raw data for use in the project. |
positional_encoding.py |
* Content: Contains the PositionalEncoding class. * Purpose: Likely related to adding positional information to the data, which is crucial in sequence-based tasks like natural language processing. |
transformer_utils.py |
* Content: Contains Transformer classes. * Purpose: Central to the implementation of Transformer-based models, indicating that the project might involve sequence-to-sequence tasks or attention mechanisms. |
other files | * Content: Used for backing up code but not actually used. * Purpose: These files seem to be reserved for backup purposes and are not actively utilized in the current project. |
The associations between dataset names in github and dataset names in paper. We collect two categories of datasets for model learning and validation. One category contains mismatches and indels, i.e., datasets CIRCLE and GUIDE_I in Table 1, and the other category contains mismatches only, i.e., other datasets. The CIRCLE dataset identifies 340 active off-target loci samples containing indels and 7031 active off-target loci samples containing mismatch only using the CIRCLE-seq technique. Note that the CIRCLE dataset is derived from the experimental data of 10 gRNAs and contains sufficient off-target samples for each gRNA, which is suitable for ten-fold cross validation. The GUIDE_I dataset also contains indel samples, but contains only 60 active off-target loci samples. For the rest datasets, we use PKD, SITE, GUIDE_II, and GUIDE_III for the mismatch-only experiments, and HEK293T and K562 for the experiments regarding epigenetic features. PKD has sufficient data for active off-target sites, but insufficient data for inactivated off-target sites. SITE has sufficient active off-target sites and inactivated off-target sites. GUIDE_II and GUIDE_III have sufficient data for inactive off-target loci samples, but only a small number of active off-target loci samples.
dataset name in github /datasets |
dataset name in paper |
---|---|
CIRCLE(mismatch&insertion&deletion) | CIRCLE |
dataset_I-2 | GUIDE_I |
PKD | Protein knockout detection (PKD) |
SITE | SITE |
GUIDE_II | GUIDE_II |
GUIDE_III | GUIDE_III |
HEK293T in "epigenetic_data" | HEK293T |
K562 in "epigenetic_data" | K562 |
Here are the experiments corresponding to each folder in /test
. The /test
directory comprises experiments designed for specific purposes, each housed in a dedicated folder. Here's an overview of the experiments:
folder | usage | main program of CRISPR-M |
---|---|---|
1indel | Comparisons on Target Sites Containing Both Mismatches and Indels | \1indel\CRISPR-M\encoding_test.py |
2encoding_test | Comparisons of Encoding Schemes | \2encoding_test\mine\encoding_test.py |
3mismatch | Comparisons on Mismatches-only sgRNA-Target Prediction | \3mismatch\mine\encoding_test.py |
4multidataset | Comparisons with Complex Off-Target Site Datasets | \4multidataset\CRISPR-M\encoding_test.py |
6epigenetic | Comparisons with Epigenetic Features | \6epigenetic\CRISPR-M\encoding_test.py |
7visualization | Visual Analysis of CRISPR-M on the Off-Target Effect Prediction | \7visualization\encoding_test.py |
9random_seed_test | Impact of random seed on AUPRC results | \9random_seed_test\CRISPR_M_mismatch_test.py et al. |
10ablation | Ablation experiments | \10ablation\ablation_test.py |
other folders | discard |
Take folder-2encoding_test as an example, encoding_test.py
in the folder-mine is the main program of the test, one could run python encoding_test.py
for run it. The test_model.py
contains the model architecture used for the test. The model in function 'm81212_n13' of test_model.py
is final model of CRISPR-M. fig2.py
in folder-fig2 is the visualization program that visualizes the results of several experiments.
Here is an output example of running the main program. The program will print the training process and the evaluation results of the model.
[INFO] ===== Start Loading dataset CIRCLE ===== [INFO] use 0-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 1-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 2-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 3-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 4-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 5-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for validation
[INFO] use 6-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 7-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 8-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] use 9-th-grna-fold grna (GTTGCCCCACAGGGCAGTAANGG) for train
[INFO] train_features.shape = (560515, 24)
[INFO] train_feature_ont.shape = (560515, 24)
[INFO] train_feature_offt.shape = (560515, 24)
[INFO] train_labels.shape = (560515,), and positive samples number = 7185
[INFO] validation_features.shape = (24434, 24)
[INFO] validation_feature_ont.shape = (24434, 24)
[INFO] validation_feature_offt.shape = (24434, 24)
[INFO] validation_labels.shape = (24434,), and positive samples number = 186
[INFO] ===== End Loading dataset CIRCLE =====
[INFO] ===== Start train =====
Model: "model_n"
Total params: 1,706,040
Trainable params: 1,704,824
Non-trainable params: 1,216
Epoch 1/500
548/548 [==============================] - ETA: 0s - loss: 0.3811 - acc: 0.8585 - auroc: 0.5045 - auprc: 0.0130
Epoch 1: val_auprc improved from -inf to 0.00561, saving model to tcrispr_model.h5
548/548 [==============================] - 56s 77ms/step - loss: 0.3811 - acc: 0.8585 - auroc: 0.5045 - auprc: 0.0130 - val_loss: 0.0759 - val_acc: 0.9924 - val_auroc: 0.3886 - val_auprc: 0.0056 - lr: 0.0010
Epoch 2/500
547/548 [============================>.] - ETA: 0s - loss: 0.0846 - acc: 0.9866 - auroc: 0.5736 - auprc: 0.0191
Epoch 2: val_auprc improved from 0.00561 to 0.00761, saving model to tcrispr_model.h5
548/548 [==============================] - 39s 70ms/step - loss: 0.0846 - acc: 0.9866 - auroc: 0.5741 - auprc: 0.0192 - val_loss: 0.1405 - val_acc: 0.9924 - val_auroc: 0.5000 - val_auprc: 0.0076 - lr: 0.0010
...
Epoch 96/500
547/548 [============================>.] - ETA: 0s - loss: 0.0193 - acc: 0.9933 - auroc: 0.9905 - auprc: 0.7845
Epoch 96: val_auprc did not improve from 0.28008
548/548 [==============================] - 39s 71ms/step - loss: 0.0193 - acc: 0.9933 - auroc: 0.9905 - auprc: 0.7843 - val_loss: 0.0448 - val_acc: 0.9928 - val_auroc: 0.8442 - val_auprc: 0.2332 - lr: 7.3787e-06
Epoch 96: early stopping
[INFO] ===== End train =====
764/764 [==============================] - 18s 19ms/step - loss: 0.0419 - acc: 0.9925 - auroc: 0.9369 - auprc: 0.2801
764/764 [==============================] - 16s 16ms/step
accuracy=0.9925104362773185, precision=1.0, recall=0.016129032258064516, f1=0.031746031746031744, fbeta=0.02008032128514056 auroc=0.9468733481621808, auprc=0.28261133474683825, auroc_by_auc=0.9468733481621808, auprc_by_auc=0.2809769808292004, spearman_corr_by_pred_score=0.13454728586392598, spearman_corr_by_pred_labels=0.12652358677195874
- Python 3.8
- tensorflow 2.9
- keras 2.9
- pandas 1.4
- numpy 1.22
- scikit-learn 1.1
- matploblib 3.5
- seaborn 0.11
Here are the relations between visualization programs and pictures in experiments.
visualization program | experiment name | figure name |
---|---|---|
/test/1indel/mean_roc_prc.py |
Comparisons on Target Sites Containing Both Mismatches and Indels | Fig. 1 |
/test/2encoding_test/fig2/fig2.py |
Comparisons on Mismatches-only sgRNA-Target Prediction | Fig. 2 |
/test/2encoding_test/fig2/fig2.py |
Comparisons with Complex Off-Target Site Datasets | Fig. 2 |
/test/2encoding_test/fig2/fig2.py |
Comparisons of Encoding Schemes | Fig. 3 |
/test/2encoding_test/fig2/fig2.py |
Comparisons with Epigenetic Features | Fig. 3 |
/test/7visualization/visual.py |
Impact of random seed on AUPRC results, and results of ablation experiments. | Fig. 5 |
/test/9random_seed_test/draw_random_seed_LOGOCV.py |
Visual Analysis of CRISPR-M on the Off-Target Effect Prediction | Fig. 4 |
This project is licensed under the terms of the MIT license.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software.
If you have any questions, please contact us by
Tel: (86) 22-85358850;
Fax: (86) 22-85358850;
Email: jianliu@nankai.edu.cn