StrusiNet2 is a tool designed to generate embeddings from RNA secondary structures using a pre-trained Siamese Neural Network model. This project takes RNA sequences with their secondary structure in dot-bracket notation, processes them into contact matrices, and then feeds them into a neural network to obtain meaningful embeddings. These embeddings can be used for downstream tasks such as clustering, classification, or other forms of analysis.
The repository contains the following key components:
src/model/siamese_model.py
: Contains the Siamese neural network definition.src/model/utils.py
: Utility functions for processing RNA data.strusinet.py
: Main script for generating embeddings from RNA secondary structures.tsne_embedding_tool.py
: A script to visualize generated embeddings using t-SNE.tests/test_model.py
andtests/test_strusinet.py
: Unit tests for ensuring the model and the whole embedding generation process work correctly.
To run the StrusiNet2 project, you will need Python and the required dependencies installed.
git clone https://github.com/nicoaira/StrusiNet2.git
cd StrusiNet2
It is recommended to use a conda
environment to manage dependencies and ensure compatibility. Follow these steps to create a new conda
environment with Python 3.12.7 and install the necessary dependencies.
conda create --name strusi_env python=3.12.7
conda activate strusi_env
With the environment activated, install all necessary dependencies using:
pip install -r requirements.txt
Dependencies include:
torch==2.4.1
torchvision==0.19.1
pandas==2.2.3
numpy==2.1.2
scikit-learn==1.5.2
argparse==1.4.0
tqdm==4.66.5
Ensure that the versions match those in the requirements.txt
The pre-trained model file (ResNet-Secondary.pth
) is not included in this repository due to its size. Please download it using the command below:
# Download the model from Google Drive and save it in the 'saved_model' directory
mkdir -p saved_model
wget -O saved_model/ResNet-Secondary.pth "https://drive.google.com/uc?export=download&id=1ltrAQ2OfmvrRx8cKxeNKK_oebwVRClEW"
StrusiNet2 can generate embeddings from RNA sequences stored in a CSV file.
The input file should be a CSV containing at least one column with the RNA secondary structure in dot-bracket notation.
To generate embeddings from an RNA dataset:
python strusinet.py --input example_data/sample_dataset.csv --output example_data/sample_dataset_with_embeddings.tsv
Arguments:
--input
: Path to the input CSV/TSV file containing RNA secondary structures.--output
: Path to save the output TSV file with embeddings.--structure_column_name
: The column name containing RNA secondary structures (default: 'secondary_structure').--structure_column_num
: (Optional) Column number of RNA secondary structures (0-indexed). If both column name and number are provided, column number will be ignored.--model_path
: Path to the trained model file (default:saved_model/ResNet-Secondary.pth
).--device
: Device to run the model on (cpu
orcuda
, default:cpu
).--header
: Specify whether the input CSV has a header row (True
orFalse
, default:True
).
If your CSV doesn't have a header and the secondary structure is in the 6th column:
python strusinet.py --input example_data/sample_dataset.csv --output example_data/sample_dataset_with_embeddings.tsv --structure_column_num 6 --header False --device cuda
After generating the RNA embeddings, you can use the compute_tsne.py
script to visualize the embeddings using t-SNE.
python compute_tsne.py --input example_data/sample_dataset_with_embeddings.tsv --output example_data/sample_dataset_with_tsne.tsv --embedding_column_name embedding_vector --n_components 3
Arguments:
--input
: Path to the input TSV file containing embeddings (e.g.,example_data/sample_dataset_with_embeddings.tsv
).--output
: Path to save the output TSV file containing the t-SNE-transformed embeddings.--embedding_column_name
: The name of the column containing the embedding vectors.--n_components
: Number of components for t-SNE (e.g., 2 for 2D visualization, 3 for 3D visualization).
- The script reads the specified
embedding_column_name
from the input file and performs t-SNE transformation with the specified number of components. - The resulting transformed embeddings are saved to the output file, which can then be used for downstream visualization and analysis.
We have also developed an application using Dash to visualize the t-SNE embeddings interactively. You can find the app in the following repository: https://github.com/nicoaira/embeddings_app.
You can run the tests using:
python -m unittest discover tests
This will run both the unit tests for the model and the integration tests for the embedding generation pipeline.
- Ensure you have the correct PyTorch version installed that supports your GPU if you're using CUDA.
- If you encounter any issues with the pre-trained model, please make sure to check the Google Drive link and download it correctly.