Qwen3 LoRA Training Has Significantly Lower Trainable Parameters Than Expected
Closed this issue · 1 comments
When fine-tuning Qwen3 models with LoRA, the percentage of trainable parameters is much lower than expected compared to PyTorch implementations. This severely limits the model's learning capacity and results in suboptimal performance.
Steps to Reproduce:
Load a Qwen3 model (e.g., Qwen/Qwen3-14B-MLX-bf16)
Configure LoRA training with default settings
Observe the trainable parameters percentage
Expected Behavior:
For a 14B parameter Qwen3 model, LoRA should train a reasonable percentage of parameters (typically 3-5% in PyTorch implementations), including attention layers (q_proj, k_proj, v_proj, o_proj) and MLP layers (gate_proj, up_proj, down_proj).
Actual Behavior:
MLX LoRA only trains approximately 0.28% of parameters for Qwen3 models, which corresponds to only self_attn.q_proj and self_attn.v_proj modules. This is insufficient for effective fine-tuning.
Root Cause:
In mlx_lm/tuner/utils.py, the default LoRA keys for Qwen3 models are overly restrictive:
keys = {"self_attn.q_proj", "self_attn.v_proj"}
This should be expanded to include all relevant attention and MLP modules for proper LoRA training.
Suggested Fix:
Update the LoRA keys for Qwen3 models in mlx_lm/tuner/utils.py:
keys = {"self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", "mlp.gate_proj", "mlp.up_proj", "mlp.down_proj"}
Environment:
MLX version: 0.17.1
Model: Qwen/Qwen3-14B-MLX-bf16
Python: 3.13
Platform: macOS (Apple Silicon)
Additional Context:
This issue prevents MLX from achieving comparable fine-tuning performance to PyTorch implementations. With the suggested fix, trainable parameters increase from ~0.28% to ~3.48%, matching PyTorch LoRA implementations.
@yuemingruoan you can easily configure the layers you want to train (and the number of blocks) using a config file. See e.g. https://github.com/ml-explore/mlx-lm/blob/main/mlx_lm/examples/lora_config.yaml for example.
The defaults are somewhat arbitrary (and it may make sense to change them), but in the meantime if you want to train all the linear layers you can do so by specifying them in the config.