Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Parameter-Efficient Fine-Tuning (PEFT) has become the de facto approach to fine-tune PFMs while decreasing the computational costs. The current status of PEFT includes:

Prefix Tuning methods, e.g., Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
Prompt Tuning methods, e.g., The Power of Scale for Parameter-Efficient Prompt Tuning
Low-rank adaptation method, e.g., LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS and AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Among these methods, we opt for adaptive rank sampling to deal with the data heterogeneous issue and LINGO: Language prefix fINe-tuning for GenOmes to leverage the in-context learning ability of LLMs. The framework is as follows:

The repository is organized as follows:

dataset/: the directory of data sets. We applied our adaptive rank sampling for a comprehensive set of genome understanding tasks on various LLMs, i.e., promoter detection, epigenetic marks prediction in yeast, and in multiple human cell types. the link is here
finetune/: fine-tuning LLMs and pre-trained DNA foundation models for single label task and multiple label tasks using DSP with BBPE tokenized embeddings and one-hot embeddings.
peftnew/: Coupling RS with AdaLoRA method
scripts/: SLURM batch script to run the .py files.
demos/: Some minimal demos to run AdaLoRA + RS with DSP on OPT and 4-bit quantized Llama. See llama_dna_sequential_finetune_QLoRA.ipynb
Besides, this link contains 2 fine-tuned checkpoints. See link. Replace "/path/to/your/local/model" with the actual file path to your saved model on your local system.

model_name_or_path: Optional[str] = field(default="/path/to/your/local/model")

Setting up environment

Typically, the setup process on a standard PC requires several tens of minutes to complete.

conda env create -f dna_llm.yml

For fine-tune

sbatch run_llm_lora.sh data_path

Models support matrix

Find models that are supported out of the box below.

Model	LoRA	AdaLoRA	Adaptive rank sampling	LINGO + one-hot	LINGO + BBPE
1000G-500M	✅	✅	✅
DNABERT-2	✅	✅	✅
OPT	✅	✅	✅	✅	✅
LLaMA	✅	✅	✅

Figures

Cite

@inproceedings{zhan2023parameter,
  title={Parameter-Efficient Fine-Tune on Open Pre-trained Transformers for Genomic Sequence},
  author={Zhan, Huixin and Zhang, Zijun Frank},
  booktitle={NeurIPS 2023 Generative AI and Biology (GenBio) Workshop},
  year={2023}
}