janhq/ichigo

training run: Lora Ichigo Qwen2.5 32B

Closed this issue · 4 comments

Goal

  • Evaluate Qwen2.5's capabilities, which surpasses LLaMA 3.1 across all English benchmarks and demonstrates particularly strong performance in understanding Asian languages like Vietnamese and Singlish.
  • Assess the effectiveness of LoRA training in teaching the model to recognize and process sound tokens.

Methodology

  • Change the base model of Ichigo from Llama 3.1 to Qwen2.5 32B
  • Due to model size constraints, we employed LoRA adapters across all linear layers for both continued pretraining and supervised fine-tuning steps, while fully fine-tuning the embedding and LM head layers to accommodate 513 new sound tokens. Following Qwen's methodology, we integrated control tokens into the embedding layer without tokenizer modification (as discussed in Qwen2.5 Issue #29) which, according to Qwen Authors, aim at optimizing training performance(NVIDIA's matrix multiplication guidelines: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#), we adjusted the embedding layer and LM head dimensions by a factor of 128, resulting in a final embedding size of 152,192 tokens.
    ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "down_proj", "up_proj"]
    
  • For pretrain data, I used only English subset of Pretrain Data and SFT on v0.4 dataset

Experiments

Run ID Date Model Config Dataset Sequence Length Learning Rate Batch Size Steps Loss Hardware MMLU MMLU pro Notes
exp1-pretrain 2024-11-23 Lora-256-512 Pretrain v0.1 512 1.5e-4 384 6302 1.9 ~ 100 hours on 8xA6000 - - old dataset
exp1-sft 2024-11-27 Lora-256-512 SFT 1024 3e-4 384 2500 1.2 60/100 hours on 6xA6000 - - stop early to prepare next run
exp2-pretrain 2024-11-26 Lora-256-512 Pretrain v0.2 512 1.5e-4 512 4726 1.9 ~ 100 hours on 8xA6000 - - new dataset v0.2
exp2-sft 2024-12-1 Lora-256-512 SFT data 4096 3e-4 256 8020 Updated soon ~ 350 hours on 8xA6000 - - on-going run

Learnings

Quicklinks

Running on 8 x A6000 in Taiwan

Discussion from today:

  • How long would it take on a H100?
  • Encoder issue?
  • Repetition?
  • We are preparing to release version 0.1 of Lora Ichigo Qwen 2.5 32B, which was trained using an earlier WhisperVQ checkpoint. iam doing some evaluation on some benchmarks like mmli, mmlu pro,...
  • We finished the Pretraining on the pretrain data on 8A6000
  • Fine-tuning phase is currently ongoing:
    • Using the mixed instruction speech WhisperVQ v4
    • Timeline has exceeded initial estimates
    • Current projected duration: approximately 350-400 hours
    • Trying to find the optimal learning rate
      ---> we can't do it on the 8xA6000.

cc @dan-homebrew @tikikun

cc @tikikun @dan-homebrew

What went wrong:

  • The model is too big to be trained on our internal cluster.
  • The evaluation of the model take a a lot of time using 8xA6000.
  • Our current checkpoint early stopped at 2000 step.
  • Trained on wrong old data.
  • Using wrong whisperVQ checkpoint.

Decision:

  • We will not release the model.

Key learning:

  • We can train a larger model (32B) with much smaller number (3B) of trainable parameters compared to 8B model.