Train a model for a new language

Question

Train a model for a new language

Opened this issue a month ago · 7 comments

I want to train a new programming language with a model. Without fine-tuning, it is completely impossible to output because it is an internal front-end framework and the open-source model does not have corresponding corpus. Now, I want to fine tune based on Qwen-2.5-Coder-32B and hope that the output component code can comply with the specifications in the internal framework documentation. And implement code writing. May I ask how to use Qwen-2.5-Coder-32B for training， Do we need to pretrain, or just fine tune based on Qwen-2.5-Coder-32B

Answer 1 · 2024-12-05T08:11:32.000Z

https://github.com/QwenLM/Qwen2.5-Coder/tree/main/finetuning

here are our finetuning scripts, you can try.

pretraining or not depends on your demands and resources. We advise you to try first. Hoping to hear your successful implementation on Qwen-Coder :)

Answer 2 · 2024-12-05T12:16:50.000Z

Thank you for your reply. If I want to try fine tune Qwen-Coder, can I do it in two steps? The first step is to learn the basic grammar knowledge of the new language, first do grammar knowledge fine-tuning , and then second do instruction fine-tuning,Could you give me some training suggestions about this .thanks

Answer 3 · 2024-12-10T10:05:52.000Z

You can try the low-quality data in the first stage and high-quality data in the second sft stage. Maybe, it brings more improvement (https://arxiv.org/abs/2412.05210).

Answer 4 · 2024-12-11T07:20:24.000Z

Okay, thank you for your suggestion. I also have a question to ask. Should we use full parameter fine-tuning or based on Lora fine-tuning? Currently, GPU resources are not very sufficient, and I plan to use Lora fine-tuning. I'm not sure about the performance.

Answer 5 · 2024-12-16T03:36:13.000Z

both way is ok, i am not sure too :(

waiting for your feedback~

Answer 6 · 2024-12-16T07:21:10.000Z

OK,Thank you. If I do pre-training, is SFT's finetune script suitable for pre-training? I see that the source code only provides the finetune script. Can this script be used for pre-training?

Answer 7 · 2024-12-24T12:50:44.000Z

You need to modify the script yourself, such as turning off the ChatML format, packing the corpus, and so on.