/DNALongBench

Primary LanguageHTMLBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

DNALongBench: A Benchmark Suite for Long-Range DNA Prediction Tasks

Introduction

DNALongBench is a benchmark of realistic and biologically meaningful genomic DNA prediction tasks that require long-range sequence input and involve long-range dependencies. There are five tasks in our DNALongBench.

image

Data Download

LR Tasks LR Type Input Length Output Shape # Samples Metric
Enhancer-target Gene Binary Classification 450,000 1 2,602 AUROC
eQTL Binary Classification 450,000 1 31,282 AUROC
Contact Map Binned (2048bp) 2D Regression 1,048,576 99,681 7,840 SCC & PCC
Regulatory Sequence Activity Binned (128bp) 1D Regression 196,608 Human: (896, 5,313)
Mouse: (896, 1,643)
Human: 38,171
Mouse: 33,521
PCC
Transcription Initiation Signal Nucleotide-wise 1D Regression 100,000 (100,000, 10) 100,000* PCC

The data for each task could be downloaded via the following link. Alternatively, you could download the data from Box.

Regulatory Sequence Activity Prediction

Data Link

The data can be downloaded at Regulatory Sequence Activity Prediction.

Data Details

We provide the sequences.bed, statistics.json, hg38.ml.fa.fai and hg38.ml.fa.gz files, and reformulate these data into train*/valid*/test*.tfr. The only files you needed are the corresponding tfr files.

Transcription Initiation Signal Prediction

Data Link

The data can be downloaded at Transcription Initiation Signal Prediction.

Data Details

We provide all the correspoding bed files.

Enhancer-Target Gene Prediction

Data Link

The data can be downloaded at Enhancer-Target Gene Prediction.

Data Details

The sequences, fa and metrics data are provided.

Contact Map Prediction

Data Link

The data can be downloaded at Contact Map Prediction.

Data Details

We provide the well-split train/valid/test files.

eQTLP

Data Link

The data can be downloaded at eQTL.

Data Details

The corresponding bed files are provided.

Experiments

We've provided the performance of three types of models, which are Expert Model, a lightweight CNN baseline, and a finetuned DNA foundation model (HyenaDNA, Caduceus-Ph and Caduceus-PS). We'll introduce below how to run these models by taking the task of Enhancer-Target Gene Prediction (ETGP) as an example.

Model Expert Model CNN HyenaDNA Caduceus-Ph Caduceus-PS
ETGP 0.926 0.797 0.828 0.826 0.821

Following the commands below to download our code:

conda create -n dnalongbench python=3.9 -y 
conda activate dnalongbench

git clone https://github.com/wenduocheng/DNALongBench.git
pip install .
Use the following Python code to load data for a specific task:
import dnalongbench
from dnalongbench.utils import load_data
train_loader, valid_loader, test_loader = load_data(root=root, task_name = 'contact_map_prediction', subset='HFF', batch_size=16)

We also provide data loaders for each task in scripts/data_loaders.ipynb.

CNN

Please refer to experiments/CNN/README.md.

HyenaDNA

Environment Setup

We used the official code of HyenaDNA. The environment setup can be found at HyenaDNA Enviroment Setup.

Be careful if you would like to use flash attention. Sometimes there are some issues when installing flash attention. We recommend first setup the environment, then activate the enviroment, and finally install flash attention inside the environment.

Training & Inference

For RSAP and TISP tasks, please refer to the readme under experiments/HyenaDNA_RSAP_TISP/README.md

experiments/HyenaDNA/HyenaDNA_RSAP_TISP/README.md

For CMP, eQTLP and ETGP tasks, please refer to experiments/HyenaDNA_ETGP_CMP_eQTLP/README.md

experiments/HyenaDNA/HyenaDNA_ETGP_CMP_eQTLP/README.md

Caduceus

Environment Setup

We used the official environment provided by Caduceus.

To get started, create a conda environment containing the required dependencies.

cd experiments/Caduceus/Caduceus_CMP_eQTLP_ETGP

conda env create -f caduceus_env.yml

Activate the environment.

conda activate caduceus_env

Training & Inference

For RSAP and TISP tasks, please refer to experiments/Caduceus_RSAP_TISP/README.md

experiments/Caduceus_RSAP_TISP/README.md

For CMP, eQTLP and ETGP tasks, please refer to experiments/Caduceus_CMP_eQTLP_ETGP/README.md

experiments/Caduceus/Caduceus_CMP_eQTLP_ETGP/README.md

GENERator

Environment Setup

We used the official environment setup of GENERator: https://github.com/GenerTeam/GENERator.git.

git clone https://github.com/GenerTeam/GENERator.git
cd GENERator
pip install -r requirements.txt

Training & Inference

Please refer to experiments/GENERator/README.md.

experiments/GENERator/README.md

Evo2

Environment Setup

We used the official environment setup of Evo2.

Evo 2 is based on StripedHyena 2 which requires python>=3.11. Evo 2 uses Transformer Engine FP8 for some layers which requires an H100 (or other GPU with compute capability ≥8.9). We are actively investigating ways to avoid this requirement.

To install Evo 2 for inference or generation, please clone and install from GitHub. We recommend using a new conda environment with python>=3.11.

git clone --recurse-submodules git@github.com:ArcInstitute/evo2.git
cd evo2
pip install .

Training & Inference

Please refer to experiments/Evo2/README.md.

experiments/Evo2/README.md

Citation

If you find our work helpful, please consider citing our paper.
@inproceedings{chengdna,
  title={DNALongBench: A Benchmark Suite for Long-Range DNA Prediction Tasks},
  author={Cheng, Wenduo and Song, Zhenqiao and Zhang, Yang and Wang, Shike and Wang, Danqing and Yang, Muyu and Li, Lei and Ma, Jian}
}