Wuming Gong, Byeong-Chan Kim, Juhyun Lee, Il-Youp Kwak (team Unlock_DNA)
The breakthrough high-throughput measurement of the cis-regulatory activity of millions of randomly generated promoters provides an unprecedented opportunity for us to systematically decode the cis-regulatory logic that determines the expression values. In this DREAM challenge, we developed an end-to-end Transformer encoder architecture (Proformer) to predict the expression values from DNA sequences. Proformer used a Macaron-like Transformer encoder architecture, where two half-step feed forward (FFN) layers were placed at the beginning and the end of each encoder block, and a separable 1D convolution layer was inserted after the first FFN layer and in front of the multi-head attention layer. The sliding k-mers from one-hot encoded sequences were mapped onto a continuous embedding, combined with the learnt positional embedding and strand embedding (forward strand vs reverse complement strand) as the sequence input. Proformer used multiple expression heads that predicted expression value for each head and used the mean of the prediction of all heads as the final predicted expression value. We empirically found that this design had significantly better performance than the conventional design such as using the global pooling layer as the output layer for the regression task. We believe that Proformer provides a novel method of learning and characterizing how cis-regulatory sequences determine the expression values. Proformer (team Unlock_DNA
) ranked in 3rd place in the final standing of the DREAM challenge.
- Model checkpoint
https://s3.msi.umn.edu/gongx030/projects/dream_PGE/notebooks_msi/m20220727e/tf_ckpts.tar
- Notebook for converting training sequences to a
tf.data
object
https://github.com/gongx030/dream_PGE/blob/main/prepare_tfdatasets.ipynb
- Notebook for model training and prediction
https://github.com/gongx030/dream_PGE/blob/main/mode_training.ipynb
- The Conda environment file:
https://github.com/gongx030/dream_PGE/blob/main/tf26_py37_a100.yml
- The JSON file for prediction:
https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.json
- The tsv file for prediction:
https://s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv
- Final report
https://github.com/gongx030/dream_PGE/blob/main/report.pdf
- Setup the hardware and the conda environment accroding according to the yml file.
- Run notebook prepare_tfdatasets.ipynb to generate a
tf.data
file for all training data. The resultingtf.data
file can be found at./s3.msi.umn.edu/gongx030/projects/dream_PGE/training_data/pct_ds=1/
. - Run notebook mode_training.ipynb to train the model on the training data and make predictions on the testing data. The model was original trained on a machine with 4 A100 GPU with cuda version of 11.7.
- The checkpoint should be found at
./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/tf_ckpts
. - The final output file should be found at
./s3.msi.umn.edu/gongx030/projects/dream_PGE/predictions/m20220727e/pred.tsv
.