Improved RNA homology detection and alignment by automatic iterative search in an expanded database
Hardware Requirments: It is recommended that your system should have 64 GB RAM, 1.5 TB disk space to support the in-memory operations for RNA sequence length less than 500. Multiple CPU threads are also recommended as the MSA generating process is computationally expensive.
Software Requirments:
RNAcmap2 has been tested on Ubuntu 14.04, 16.04, and 18.04 operating systems.
Clone RNAcmap2 github repo:
git clone https://github.com/jaswindersingh2/RNAcmap2.git && cd RNAcmap2
Just run the following command to create Conda virtual environment and install Conda dependencies:
-
conda env create --file environment.yaml
-
conda activate venv_rnacmap2
For mfDCA and plmDCA:
pip install pydca
For PLMC:
git clone https://github.com/debbiemarkslab/plmc && cd plmc && make all-openmp && cd -
For GREMLIN:
git clone "https://github.com/sokrypton/GREMLIN_CPP" && cd GREMLIN_CPP && g++ -O3 -std=c++0x -o gremlin_cpp gremlin_cpp.cpp -fopenmp && cd ../
./db_download.sh
makeblastdb -in ./database/nt_metagenomics_database/nt_metagenomics2 -dbtype nucl
To run RNAcmap2:
./run_rnacmap2.sh 6p2h_A.fasta mfdca ./database/nt_metagenomics_database/nt_metagenomics2
Refer to benchmarking folder of this repo.
- cmbuild, cmcalibrate, and cmsearch from INFERNAL tool version 1.1.4
- esl-reformat from easel tool version 0.48
- blastn and makeblastdb from BLAST tool version 2.11.0
- RNAfold from ViennaRNA version 2.4.18
- utils/reformat.pl from HHsuite-github-repo
- utils/getpssm.pl and utils/parse_blastn_local.pl from RNAsol standalone program
- utils/seqkit from seqkit toolkit
- PLMC from plmc-github-repo
- GREMLIN from gremlin-github-repo
- mfDCA and plmDCA from pydca-github-repo
If use RNAcmap2 for your research, please cite the following papers:
Jaswinder Singh, Kuldip Paliwal, Jaspreet Singh, Thomas Litfin, and Yaoqi Zhou. "Improved RNA homology detection and alignment by automatic iterative search in an expanded database."
If use RNAcmap2 pipeline, please consider citing the following papers:
BLAST-N:
[1] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.
INFERNAL:
[2] Nawrocki, E.P. and Eddy, S.R., 2013. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 29(22), pp.2933-2935.
RNAfold:
[3] Lorenz, R., Bernhart, S.H., Zu Siederdissen, C.H., Tafer, H., Flamm, C., Stadler, P.F. and Hofacker, I.L., 2011. ViennaRNA Package 2.0. Algorithms for molecular biology, 6(1), pp.1-14.
RNAcmap Pipeline:
[4] Zhang, T., Singh, J., Litfin, T., Zhan, J., Paliwal, K. and Zhou, Y., 2021. RNAcmap: a fully automatic pipeline for predicting contact maps of RNAs by evolutionary coupling analysis. Bioinformatics.
PLMC:
[5] Hopf, T.A., Ingraham, J.B., Poelwijk, F.J., Schärfe, C.P., Springer, M., Sander, C. and Marks, D.S., 2017. Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2), pp.128-135.
GREMLIN:
[6] Kamisetty, H., Ovchinnikov, S. and Baker, D., 2013. Assessing the utility of coevolution-based residue–residue contact predictions in a sequence-and structure-rich era. Proceedings of the National Academy of Sciences, 110(39), pp.15674-15679.
mfDCA and plmDCA:
[7] Zerihun, MB., Pucci, F, Peter, EK, and Schug, A. pydca: v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences. Bioinformatics, btz892, doi.org/10.1093/bioinformatics/btz892
[8] Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, DS., Sander, C., Zecchina, R., Onuchic, JN., Hwa, T., and Weigt, M. Direct-coupling analysis of residue coevolution captures native contacts across many protein families PNAS December 6, 2011 108 (49) E1293-E1301, doi:10.1073/pnas.1111471108
[9] Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M., & Aurell, E. (2013). Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Physical Review E, 87(1), 012707, doi:10.1103/PhysRevE.87.012707
SeqKit:
[10] Shen, W., Le, S., Li, Y. and Hu, F., 2016. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PloS one, 11(10), p.e0163962.
If use RNAcmap2 datasets, please consider citing the following papers:
Protein Data Bank (PDB):
[11] Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E., 2000. The protein data bank. Nucleic acids research, 28(1), pp.235-242.
CD-HIT-EST:
[12] Fu, L., Niu, B., Zhu, Z., Wu, S. and Li, W., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), pp.3150-3152.
Mozilla Public License 2.0
jaswinder.singh3@griffithuni.edu.au, yaoqi.zhou@griffith.edu.au