Update March 2024:
- Tutorials for RNA family clustering and RNA type classification & Tutorial video (in Chinese).
- mRNA-FM, a foundation model pre-trained on coding sequences (CDS) in mRNA is now released! The model can take into CDSs and represent them with contextual embeddings, benefiting mRNA and protein related tasks.
This repository contains codes and pre-trained models for RNA foundation model (RNA-FM). RNA-FM outperforms all tested single-sequence RNA language models across a variety of structure prediction tasks as well as several function-related tasks. You can find more details about RNA-FM in our paper, "Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions" (Chen et al., 2022).
Citation
@article{chen2022interpretable,
title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},
author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},
journal={arXiv preprint arXiv:2204.00300},
year={2022}
}
Table of contents
First, download the repository and create the environment.
git clone https://github.com/ml4bio/RNA-FM.git
cd ./RNA-FM
conda env create -f environment.yml
Then, activate the "RNA-FM" environment and enter into the workspace.
conda activate RNA-FM
cd ./redevelop
Download pre-trained models from this gdrive link and place the pth files into the pretrained
folder.
python launch/predict.py --config="pretrained/extract_embedding.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1 --save_embeddings
RNA-FM embeddings with shape of (L,640) will be saved in the $save_dir/representations
.
As For mRNA-FM, you can call it with an extra argument, MODEL.BACKBONE_NAME
:
python launch/predict.py --config="pretrained/extract_embedding.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1 --save_embeddings --save_embeddings_format raw MODEL.BACKBONE_NAME mrna-fm
python launch/predict.py --config="pretrained/ss_prediction.yml" \
--data_path="./data/examples/example.fasta" --save_dir="./resuts" \
--save_frequency 1
The predicted probability maps will be saved in form of .npy
files, and the post-processed binary predictions will be saved in form of .ct
files. You can find them in the $save_dir/r-ss
.
If you have any trouble with the deployment of the local version of RNA-FM, you can access its online version from this link, RNA-FM server. You can easily submit jobs on the server and download results from it afterwards, without setting up environment and occupying any computational resources.
Python 3.8 (maybe higher version) and PyTorch are the prerequisite packages which you must have installed to use this repository.
You can install rna-fm
in your own environment with the following pip command if you just want to
use the pre-trained language model.
you can either install rna-fm from PIPY:
pip install rna-fm
or install rna-fm
from github:
cd ./RNA-FM
pip install .
After installation, you can load the RNA-FM and extract its embeddings with the following code:
import torch
import fm
# Load RNA-FM model
model, alphabet = fm.pretrained.rna_fm_t12()
batch_converter = alphabet.get_batch_converter()
model.eval() # disables dropout for deterministic results
# Prepare data
data = [
("RNA1", "GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),
("RNA2", "GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
("RNA3", "CGAUUCNCGUUCCC--CCGCCUCCA"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# Extract embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[12])
token_embeddings = results["representations"][12]
More tutorials can be found from https://ml4bio.github.io/RNA-FM/. The related notebooks are stored in the tutorials
folder.
As for mRNA-FM, the above code needs a slight revision. To be noted, the length of input RNA sequences should be the multiple of 3 to ensure the sequence can be tokenized into a series of codons (3-mer).
import torch
import fm
# Load mRNA-FM model
model, alphabet = fm.pretrained.mrna_fm_t12()
batch_converter = alphabet.get_batch_converter()
model.eval() # disables dropout for deterministic results
# Prepare data
data = [
("CDS1", "AUGGGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCUA"),
("CDS2", "AUGGGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
("CDS3", "AUGCGAUUCNCGUUCCC--CCGCCUCC"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# Extract embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[12])
token_embeddings = results["representations"][12]
Shorthand | Code | Subject | Layers | Embed Dim | Max Length | Input | Token | Dataset | Description | Year | Publisher |
---|---|---|---|---|---|---|---|---|---|---|---|
RNA-FM | Yes | ncRNA | 12 | 640 | 1024 | Seq | base | RNAcentral 19 (23 million samples) | The first RNA language model for general purpose | 2022.04 | arxiv/bioRxiv |
RNABERT | Yes | ncRNA | 6 | 120 | 440 | Seq | base | RNAcentral (762370) & Rfam 14.3 dataset (trained with partial MSA) | Specialized in structural alignment and clustering | 2022.02 | NAR Genomics and Bioinformatics |
UNI-RNA | No | RNA | 24 | 1280 | Seq | base | RNAcentral & nt & GWH (1 billion) | A general model with larger scale and datasets than RNA-FM | 2023.07 | bioRxiv | |
RNA-MSM | Yes | ncRNA | 12 | 768 | 1024 | MSA | base | 4069 RNA families from Rfam 14.7 | A model utilize evolutionary information from MSA directly | 2023.03 | NAR |
SpliceBERT | Yes | pre-mRNA | 6 | 1024 | 512 | Seq | base | 2 million precursor messenger RNA (pre-mRNA) | Specialized in RNA splicing of pre-mRNA | 2023.05 | bioRxiv |
CodonBERT | No | mRNA CDS | 12 | 768 | 512*2 | Seq | codon (3mer) | 10 million mRNAs from NCBI | Only focus on CDS of mRNA without UTRs | 2023.09 | bioRxiv |
UTR-LM | Yes | mRNA 5'UTR | 6 | 128 | Seq | base | 700K 5'UTRs from Ensembl & eGFP & mCherry & Cao | Used for 5'UTR and mRNA expression related tasks | 2023.10 | bioRxiv | |
3UTRBERT | Yes | mRNA 3'UTR | 12 | 768 | 512 | Seq | k-mer | 20,362 3'UTRs | Used for 3'UTR mediated gene regulation tasks | 2023.09 | bioRxiv |
BigRNA | No | DNA | - | - | - | Seq | - | thousands of genome-matched datasets | tissue-specific RNA expression, splicing, microRNA sites, and RNA binding protein | 2023.09 | bioRxiv |
If you find the models useful in your research, we ask that you cite the relevant paper:
For RNA-FM:
@article{chen2022interpretable,
title={Interpretable rna foundation model from unannotated data for highly accurate rna structure and function predictions},
author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},
journal={arXiv preprint arXiv:2204.00300},
year={2022}
}
The model of this code builds on the esm sequence modeling framework. And we use fairseq sequence modeling framework to train our RNA language modeling. We very appreciate these two excellent works!
This source code is licensed under the MIT license found in the LICENSE
file
in the root directory of this source tree.