training run: Lora Ichigo Qwen2.5 32B
Closed this issue · 4 comments
bachvudinh commented
Goal
- Evaluate Qwen2.5's capabilities, which surpasses LLaMA 3.1 across all English benchmarks and demonstrates particularly strong performance in understanding Asian languages like Vietnamese and Singlish.
- Assess the effectiveness of LoRA training in teaching the model to recognize and process sound tokens.
Methodology
- Change the base model of Ichigo from Llama 3.1 to Qwen2.5 32B
- Due to model size constraints, we employed LoRA adapters across all linear layers for both continued pretraining and supervised fine-tuning steps, while fully fine-tuning the embedding and LM head layers to accommodate 513 new sound tokens. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification (as discussed in Qwen2.5 Issue #29) which, according to Qwen Authors, aim at optimizing training performance(NVIDIA's matrix multiplication guidelines: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#), we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.
["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"]
- For pretrain data, I used only English subset of Pretrain Data and SFT on v0.4 dataset
Experiments
Run ID | Date | Model Config | Dataset | Sequence Length | Learning Rate | Batch Size | Steps | Loss | Hardware | MMLU | MMLU pro | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|---|
exp1-pretrain | 2024-11-23 | Lora-256-512 | Pretrain v0.1 | 512 | 1.5e-4 | 384 | 6302 | 1.9 | ~ 100 hours on 8xA6000 | - | - | old dataset |
exp1-sft | 2024-11-27 | Lora-256-512 | SFT | 1024 | 3e-4 | 384 | 2500 | 1.2 | 60/100 hours on 6xA6000 | - | - | stop early to prepare next run |
exp2-pretrain | 2024-11-26 | Lora-256-512 | Pretrain v0.2 | 512 | 1.5e-4 | 512 | 4726 | 1.9 | ~ 100 hours on 8xA6000 | - | - | new dataset v0.2 |
exp2-sft | 2024-12-1 | Lora-256-512 | SFT data | 4096 | 3e-4 | 256 | 8020 | Updated soon | ~ 350 hours on 8xA6000 | - | - | on-going run |
Learnings
Quicklinks
- Old Pretrain data v0.1: https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v1.
- Pretrain data v0.2: https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v2.
- SFT Data: https://huggingface.co/datasets/homebrewltd/mixed-instruction-speech-whispervq-v4
- Checkpoints:
dan-homebrew commented
Running on 8 x A6000 in Taiwan
dan-homebrew commented
Discussion from today:
- How long would it take on a H100?
- Encoder issue?
- Repetition?
bachvudinh commented
- We are preparing to release version 0.1 of Lora Ichigo Qwen 2.5 32B, which was trained using an earlier WhisperVQ checkpoint. iam doing some evaluation on some benchmarks like mmli, mmlu pro,...
- We finished the Pretraining on the pretrain data on 8A6000
- Fine-tuning phase is currently ongoing:
- Using the mixed instruction speech WhisperVQ v4
- Timeline has exceeded initial estimates
- Current projected duration: approximately 350-400 hours
- Trying to find the optimal learning rate
---> we can't do it on the 8xA6000.
bachvudinh commented
What went wrong:
- The model is too big to be trained on our internal cluster.
- The evaluation of the model take a a lot of time using 8xA6000.
- Our current checkpoint early stopped at 2000 step.
- Trained on wrong old data.
- Using wrong whisperVQ checkpoint.
Decision:
- We will not release the model.
Key learning:
- We can train a larger model (32B) with much smaller number (3B) of trainable parameters compared to 8B model.