DREAM Challenge 2022

Predicting gene expression using millions of random promoter sequences

Wuming Gong, Byeong-Chan Kim, Juhyun Lee, Il-Youp Kwak (team Unlock_DNA)

Abstract

The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity for us to systematically decode the cis-regulatory logic that determines the expression values. In this DREAM challenge, we developed an end-to-end Transformer encoder architecture (Proformer) to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learnt positional embedding and strand embedding (forward strand vs reverse complement strand) as the sequence input. Proformer used multiple expression heads that predicted expression value for each head and used the mean of the prediction of all heads as the final predicted expression value. We empirically found that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. We believe that Proformer provides a novel method of learning and characterizing how cis-regulatory sequences determine the expression values. Proformer (team Unlock_DNA) ranked in 3rd place in the final standing of the DREAM challenge.

Talk in RSG/DREAM 2022

Proformer model

A hybrid Macaron transformer model predicts expression values from promoter sequences.

Model checkpoint

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/notebooks_msi/m20220727e/tf_ckpts.tar

Notebook for converting training sequences to a tf.data object

https://github.com/gongx030/dream_PGE/blob/main/prepare_tfdatasets.ipynb

Notebook for model training and prediction

https://github.com/gongx030/dream_PGE/blob/main/mode_training.ipynb

The Conda environment file:

https://github.com/gongx030/dream_PGE/blob/main/tf26_py37_a100.yml

The JSON file for prediction:

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.json

The tsv file for prediction:

https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv

Final report

https://github.com/gongx030/dream_PGE/blob/main/report.pdf

The guide to training the model

Setup the hardware and the conda environment accroding according to the yml file.
Run notebook prepare_tfdatasets.ipynb to generate a tf.data file for all training data. The resulting tf.data file can be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/training_data/pct_ds=1/.
Run notebook mode_training.ipynb to train the model on the training data and make predictions on the testing data. The model was original trained on a machine with 4 A100 GPU with cuda version of 11.7.
The checkpoint should be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/tf_ckpts.
The final output file should be found at ./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv.