training run: Lora Ichigo Qwen2.5 32B

Question

Closed this issue 23 days ago · 4 comments

Goal

Evaluate Qwen2.5's capabilities, which surpasses LLaMA 3.1 across all English benchmarks and demonstrates particularly strong performance in understanding Asian languages like Vietnamese and Singlish.
Assess the effectiveness of LoRA training in teaching the model to recognize and process sound tokens.

Change the base model of Ichigo from Llama 3.1 to Qwen2.5 32B
Due to model size constraints, we employed LoRA adapters across all linear layers for both continued pretraining and supervised fine-tuning steps, while fully fine-tuning the embedding and LM head layers to accommodate 513 new sound tokens. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification (as discussed in Qwen2.5 Issue #29) which, according to Qwen Authors, aim at optimizing training performance(NVIDIA's matrix multiplication guidelines: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#), we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.
```
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"]
```
For pretrain data, I used only English subset of Pretrain Data and SFT on v0.4 dataset

Run ID	Date	Model Config	Dataset	Sequence Length	Learning Rate	Batch Size	Steps	Loss	Hardware	MMLU	MMLU pro	Notes
exp1-pretrain	2024-11-23	Lora-256-512	Pretrain v0.1	512	1.5e-4	384	6302	1.9	~ 100 hours on 8xA6000	-	-	old dataset
exp1-sft	2024-11-27	Lora-256-512	SFT	1024	3e-4	384	2500	1.2	60/100 hours on 6xA6000	-	-	stop early to prepare next run
exp2-pretrain	2024-11-26	Lora-256-512	Pretrain v0.2	512	1.5e-4	512	4726	1.9	~ 100 hours on 8xA6000	-	-	new dataset v0.2
exp2-sft	2024-12-1	Lora-256-512	SFT data	4096	3e-4	256	8020	Updated soon	~ 350 hours on 8xA6000	-	-	on-going run

Answer 1 · 2024-11-27T06:41:14.000Z

Running on 8 x A6000 in Taiwan

Answer 2 · 2024-12-04T06:06:58.000Z

Discussion from today:

Answer 3 · 2024-12-04T06:38:26.000Z

We are preparing to release version 0.1 of Lora Ichigo Qwen 2.5 32B, which was trained using an earlier WhisperVQ checkpoint. iam doing some evaluation on some benchmarks like mmli, mmlu pro,...
We finished the Pretraining on the pretrain data on 8A6000
Fine-tuning phase is currently ongoing:
- Using the mixed instruction speech WhisperVQ v4
- Timeline has exceeded initial estimates
- Current projected duration: approximately 350-400 hours
- Trying to find the optimal learning rate
  ---> we can't do it on the 8xA6000.

Answer 4 · 2024-12-04T09:19:13.000Z

What went wrong:

Decision:

Key learning:

We can train a larger model (32B) with much smaller number (3B) of trainable parameters compared to 8B model.