[doc] Best Practice
Opened this issue · 2 comments
tpoisonooo commented
Result
Here is our final train result which 50% SFT data comes from GraphGen.
Domain | Dataset | our-7B-model | Qwen2.5-7B-Instruct |
---|---|---|---|
Plant | SeedBench | 65.9 | 51.5 |
Common | CMMLU | 73.6 | 75.8 |
Logic | GPQA-Diamond | 40.0 | 33.3 |
Math | AIME24 | 20.6 | 16.7 |
AIME25 | 22.7 | 7.2 |
Garbage in, garbage out
First, it's essential to ensure the high quality of the input chunk.
- Positive example: A complete small story segment
- Negative example: A part of a paper citation, only have title, lack of information
Secondly, filter the QA pairs according to business needs. The synthetic QA data contains entity words, but not every entity word should be present.
- Positive example: The glorious deeds of the company's boss
- Negative example: Meaningless coreference resolution.. "fig 5.1", "it"
API usage
- Make sure LLM API supports
logprobs
(such asvllm serve
withv0.6.6post1
) and enableTrainee Model
for hardcase mining. SiliconCloud on OpenXLab web is just for free trial, real production would not be free.
- Use a bigger synthesizer model. Ensure that the synthesizer and the trainee are of the same origin.
LuletterSoul commented
Hi, thanks for your great tools. I want to train a smaller vlm model, can i use a large vlm model to generate data?
tpoisonooo commented
Hi, thanks for your great tools. I want to train a smaller vlm model, can i use a large vlm model to generate data?
As I know, text-image alignment is the key point for VLM trainning. You may have to add image node into knowledge in GraphGen.