Parameter-Efficient Fine-Tuning (PEFT) has become the de facto approach to fine-tune PFMs while decreasing the computational costs. The current status of PEFT includes:
- Prefix Tuning methods, e.g., Prefix-Tuning: Optimizing Continuous Prompts for Generation, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
- Prompt Tuning methods, e.g., The Power of Scale for Parameter-Efficient Prompt Tuning
- Low-rank adaptation method, e.g., LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS and AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Among these methods, we opt for adaptive rank sampling to deal with the data heterogeneous issue and LINGO: Language prefix fINe-tuning for GenOmes to leverage the in-context learning ability of LLMs. The framework is as follows:
The repository is organized as follows:
- dataset/: the directory of data sets. We applied our adaptive rank sampling for a comprehensive set of genome understanding tasks on various LLMs, i.e., promoter detection, epigenetic marks prediction in yeast, and in multiple human cell types. the link is here
- finetune/: fine-tuning LLMs and pre-trained DNA foundation models for single label task and multiple label tasks using DSP with BBPE tokenized embeddings and one-hot embeddings.
- peftnew/: Coupling RS with AdaLoRA method
- scripts/: SLURM batch script to run the .py files.
- demos/: Some minimal demos to run AdaLoRA + RS with DSP on OPT and 4-bit quantized Llama. See llama_dna_sequential_finetune_QLoRA.ipynb
- Besides, this link contains 2 fine-tuned checkpoints. See link. Replace "/path/to/your/local/model" with the actual file path to your saved model on your local system.
model_name_or_path: Optional[str] = field(default="/path/to/your/local/model")
Typically, the setup process on a standard PC requires several tens of minutes to complete.
conda env create -f dna_llm.yml
sbatch run_llm_lora.sh data_path
Find models that are supported out of the box below.
Model | LoRA | AdaLoRA | Adaptive rank sampling | LINGO + one-hot | LINGO + BBPE |
---|---|---|---|---|---|
1000G-500M | ✅ | ✅ | ✅ | ||
DNABERT-2 | ✅ | ✅ | ✅ | ||
OPT | ✅ | ✅ | ✅ | ✅ | ✅ |
LLaMA | ✅ | ✅ | ✅ |
@inproceedings{zhan2023parameter, title={Parameter-Efficient Fine-Tune on Open Pre-trained Transformers for Genomic Sequence}, author={Zhan, Huixin and Zhang, Zijun Frank}, booktitle={NeurIPS 2023 Generative AI and Biology (GenBio) Workshop}, year={2023} }