Comprehensive benchmarking of Large Language Models for RNA secondary structure prediction

This repository contains the data and source code used for the "Comprehensive benchmarking of Large Language Models for RNA secondary structure prediction," by L.I. Zablocki, L.A. Bugnon, M. Gerard, L. Di Persia, G. Stegmayer, D.H. Milone (under review), Research Institute for Signals, Systems and Computational Intelligence, sinc(i). See the preprint for details.

In the last three years, a number of RNA large language models (RNA-LLM) have appeared in literature. We selected the models for benchmarking based on their open access availability, summarizing the main features in the table below.

LLM	Dim	Pre-training seqs	Pre-training databases	Architecture (number of layers)	Number of parameters	Source
RNABERT 2022 [1]	120	70 k	RNAcentral	Transformer (6)	500 k	Link
RNA-FM 2022 [2]	640	23 M	RNAcentral	Transformer (12)	100 M	Link
RNA-MSM 2024 [3]	768	3 M	Rfam	Transformer (12)	96 M	Link
ERNIE-RNA 2024 [4]	768	20 M	RNAcentral	Transformer (12)	86 M	Link
RNAErnie 2024 [5]	768	23 M	RNAcentral	Transformer (12)	105 M	Link
RiNALMo 2024 [6]	1280	36 M	RNAcentral +Rfam +Ensembl	Transformer (33)	650 M	Link

Installation

These steps will guide you through the process of training the secondary structure RNA predictor model, based on the RNA-LLM representations.

First:

git clone https://github.com/sinc-lab/rna-llm-folding
cd rna-llm-folding

With a conda working installation, run:

conda env create -f environment.yml

This should install all required dependencies. Then, activate the environment with:

conda activate rna-llm-folding

Reproduction of experiments

Scripts to train and evaluate a RNA-LLM for RNA secondary structure prediction are in the scripts folder.

For example, to use the one-hot embedding for the ArchiveII dataset, run:

python scripts/run_archiveII_famfold.py --emb one-hot_ArchiveII

The --emb option is used to tell the script the desired LLM and dataset combination that will be used for training and testing. In the example, we used the one-hot embedding for ArchiveII, already available in data/embeddings. By default, train will be executed on GPU if available. Results will be saved in results/<timestamp>/<dataset>/<llm>.

To run the experiments with other datasets, use scripts/run_bpRNA.py, scripts/run_bpRNA_new.py, scripts/run_pdb-rna.py and scripts/run_archiveII_kfold.py, which are invoked the same way that’s described in the example.

To use other embeddings and datasets, download the RNA-LLM embedding representations for the desired LLM-dataset combination from the following table, and save them in the data/embeddings directory.

ArchiveII	bpRNA & bpRNA-new	PDB-RNA
one-hot	one-hot	one-hot
RNABERT	RNABERT	RNABERT
RNA-FM	RNA-FM	RNA-FM
RNA-MSM	RNA-MSM	RNA-MSM
ERNIE-RNA	ERNIE-RNA	ERNIE-RNA
RNAErnie	RNAErnie	RNAErnie
RiNALMo	RiNALMo	RiNALMo

Note: Instructions to generate the RNA-LLM embeddings listed above are detailed in scripts/embeddings.

Comparison results

Projection of RNA-LLM embeddings: The UMAP notebook makes use of a UMAP projection to illustrate the high-dimensional embeddings into a 3D space.
Performance on increasing homology challenge datasets: The violinplots notebook generates the plots for performance analysis for each RNA-LLM with the different datasets.
Cross-family benchmarks: We used the boxplots notebook to assess inter-family performance.
Non-canonical and motifs performance: This notebook generates the comparison of non-canonical base pairs and the characterization of performance by structural motifs.