RNAGEN

A generative adversarial network-based model to generate synthetic RNA sequences to target proteins

RNAGEN is a is a deep learning based model for novel piRNA generation and optimziation. We use the WGAN-GP architecture for the generative model, and the DeepBind models for optimizing binding of the generated piRNA sequences to the target protein. To find the closest relatives of a target protein to be used in optimization, we use the Prot2Vec model.

Deep Learning, WGAN-GP, DeepBind, Prot2Vec

Diagram of the generative model and the optimization procedure

Authors

Furkan Ozden, Sina Barazandeh, Dogus Akboga, Urartu Ozgur Safak Seker, A. Ercument Cicek

Questions & comments

[secondauthorname].[secondauthorsurname]@bilkent.edu.tr

[firstcorrespondingauthorfirstname]@bilkent.edu.tr

[secondcorrespondingauthorsurname]@cs.bilkent.edu.tr

Warning: Please note that the RNAGEN model is completely free for academic usage. However it is licenced for commercial usage. Please first refer to the License section for more info.

Installation
Features
File Description
Instructions Manual
Usage Examples
Citations
License

Installation

RNAGEN is easy to use and does not require installation. The scripts can be used if the requirements are installed.

Note: The implementation is using Tensorflow 1.15, but the provided code uses Tensorflow 2 for easier installation and use. Tensorflow 1 is available in Tensorflow 2 using the tf.compat.v1 module.

Requirements

For easy requirement handling, you can use RNAGEN.yml files to initialize conda environment with requirements installed:

$ conda env create --name rnagen -f RNAGEN.yml
$ conda activate rnagen

Note that the provided environment yml file is for Linux systems. For MacOS users, the corresponding versions of the packages might need to be changed.

Features

RNAGEN components are trained using GPUs and GPUs are used for the project. However, depending on the type of Tensorflow Tensorflow the model can run on both GPU and CPU. The run time on CPU is considerably longer compared to GPU.

File-Folder Description

./analysis/binding_plot.py : plots the binding score plots on the manuscript
./analysis/gan_validation.py : plots gan validation plots on the manuscript
./analysis/generate.py : generates the ./analysis/generated.txt sequences using the trained model
./analysis/gen_ham_dist.npy : saved array of distances between the generated and the natural set of sequences (regenerated if removed)
./analysis/rand_ham_dist.npy : saved array of distances between the random and the natural set of sequences (regenerated if removed)
./analysis/real_ham_dist.npy : saved array of non-zero distances between the set of natural sequences and itself (regenerated if removed)
./analysis/generated.txt : generated using the trained generator
./analysis/piRNAs.fa : set of natural sequences extracted from ./data/DASHR2_GEO_hg38_sequenceTable_export.csv
./data/model/ : the trained generator model
./data/DASHR2_GEO_hg38_sequenceTable_export.csv : the original dataset used for training and validation
./data/protVec_100d_3grams.csv : taken from Prot2Vec, used for calculating distances between proteins
./deepbind_models : taken directly from the deepbind directory
./figures/ : directory for the generated figures
./input/opt_input.csv : input sample for the ./optimize.py script
./input/proteins.csv : input sample for the ./distance_proteins.py script
./lib/ : utility files for the implementation
./output/ : output of the ./optimize.py are saved here (SOX2, SOX4 already exist).
./distance_proteins.py : calculates distances between sequences for the input csv file
./optimize.py : optimizes generated piRNAs based on the given input csv file
./train_rnagen.py : trains the RNAGEN generator using the natural sequences

Instructions Manual

Important notice: Please call the train_rnagen.py script from the root directory. The optimization can be performed using the optimize.py script. To analyze the generated seqeunces use the ./analysis/gan_validation.py script. This script will generate plots that are used in the manuscript to validate the GAN model.

To plot the optimization and initial binding score plots, you need to first run the optimization, and then use the ./analysis/binding_plot.py by giving the required arguments. These arguments will be used to find the optimization output directory for plotting.

Train teh GAN modes:

$ python ./train_rnagen.py

Optional Arguments

-mil, --min_length

Minimum length of the piRNAs used for training the model. Default: 26.

-mxl, --max_length

Maximum length of the piRNAs used for training the model. Default: 32.

-d, --dataset

The path to file including the piRNA samples. The default path used is './data/DASHR2_GEO_hg38_sequenceTable_export.csv'.

-lr, --learning_rate

The learning rate of the Adam optimizer used to optimize the model parameters. The default value is 1e-5. If 4 is provided, the learning rate will be 1e-4.

-bs, --batch_size

The batch size used to train the model, the default value is 64

-dim, --dimension

The dimension of the input noise, with the default value of 40.

-g, --gpu

The ID of the GPU that will be used for training. The default setting is used otherwise.

Output

The output of the training will be in './logs/gan_test/{timestamp}/'.

Find the distances of a list of proteins to the target protein using Prot2Vec:

$ python ./distance_proteins.py

Optional Arguments

-i

The path to a text file including protein names and sequences. The default path is './input/proteins.csv' and by editing that file, the script will use those sequences and names.

The file includes at least 2 proteins. The first one is the target protein, ones are the protines for which we want to calculate the distance.

Output

The output will be shown on the terminal. It will include the protein names and their distances to the target protein in the Prot2Vec space.

Optimize for a target Protein using 3 Relative Proteins:

$ python ./optimize.py

Optional Arguments

-i

The path to a text file including paths of the deepbind models of the relative proteins. The default path is './input/opt_input.csv' and by editing that file, the optimization will use those files and proteins.

The file includes 5 proteins. The first one is the target protein, the next three are the chosen relative proteins for optimization, and the last one is optional test protein in case you want to test the results on a different protein (disabled for now).

-t

The path to the tarined generator model. Default value is './data/model/trained_gan.ckpt.meta'.

-lr

The learning rate of the Adam optimizer used to optimize the model parameters. The default value is 3e-5.

-n

The number of iterations the optimization is performed. The default value is 3,000 iterations.

-gpu

The id of the gpu used for optimization. The default value will be used otherwise.

Output

The optimiztation outputs will be put in './output/'. The file names are descriptive.

Analysis Files:

Generate sequences using the trained generator

$ python ./analysis/generate.py

It will output the './analysis/generated.txt' file.

Analyze the GAN generated sequences

$ python ./analysis/gan_validation.py

It will output plots in the './figures/' folder

Binding score plots (Reproduce results in the manuscript)

$ python ./analysis/binding_plots.py

Optional Arguments

-p1

The name of the first protein (SOX4 as default).

-p2

The name of the first protein (SOX2 as default).

-n1

The number of optimization iterations for the first protein (3000 as default).

-n2

The number of optimization iterations for the second protein (3000 as default).

Output

It will output plots in the './figures/' folder

Usage Examples

Usage of RNAGEN is very simple. You need to install conda to install the specific environment and run the scripts.

Step-0: Install conda package management

This project uses conda package management software to create virtual environment and facilitate reproducability.
For Linux users:
Please take a look at the Anaconda repo archive page, and select an appropriate version that you'd like to install.
Replace this Anaconda3-version.num-Linux-x86_64.sh with your choice

$ wget -c https://repo.continuum.io/archive/Anaconda3-vers.num-Linux-x86_64.sh
$ bash Anaconda3-version.num-Linux-x86_64.sh

Step-1: Set Up your environment.

It is important to set up the conda environment which includes the necessary dependencies.
Please run the following lines to create and activate the environment:

$ conda env create --name rnagen -f RNAGEN.yml
$ conda activate rnagen

Citations

The study is available as a preprint on biorixv. https://www.biorxiv.org/content/10.1101/2023.07.11.548246v1

License

CC BY-NC-SA 2.0
For commercial usage, please contact.

ciceklab/RNAGEN

RNAGEN

A generative adversarial network-based model to generate synthetic RNA sequences to target proteins

Authors

Questions & comments

Table of Contents

Installation

Requirements

Note that the provided environment yml file is for Linux systems. For MacOS users, the corresponding versions of the packages might need to be changed.

Features

File-Folder Description

Instructions Manual

Train teh GAN modes:

Optional Arguments

-mil, --min_length

-mxl, --max_length

-d, --dataset

-lr, --learning_rate

-bs, --batch_size

-dim, --dimension

-g, --gpu

Output

Find the distances of a list of proteins to the target protein using Prot2Vec:

Optional Arguments

-i

Output

Optimize for a target Protein using 3 Relative Proteins:

Optional Arguments

-i

-t

-lr

-n

-gpu

Output

Analysis Files:

Generate sequences using the trained generator

Analyze the GAN generated sequences

Binding score plots (Reproduce results in the manuscript)

Optional Arguments

-p1

-p2

-n1

-n2

Output

Usage Examples

Step-0: Install conda package management

Step-1: Set Up your environment.

Citations

License