[doc] Best Practice

Question

Opened this issue 5 months ago · 2 comments

Result

Here is our final train result which 50% SFT data comes from GraphGen.

Domain	Dataset	our-7B-model	Qwen2.5-7B-Instruct
Plant	SeedBench	65.9	51.5
Common	CMMLU	73.6	75.8
Logic	GPQA-Diamond	40.0	33.3
Math	AIME24	20.6	16.7
	AIME25	22.7	7.2

First, it's essential to ensure the high quality of the input chunk.

Positive example: A complete small story segment
Negative example: A part of a paper citation, only have title, lack of information

Secondly, filter the QA pairs according to business needs. The synthetic QA data contains entity words, but not every entity word should be present.

Make sure LLM API supports logprobs (such as vllm serve with v0.6.6post1) and enable Trainee Model for hardcase mining. SiliconCloud on OpenXLab web is just for free trial, real production would not be free.

Use a bigger synthesizer model. Ensure that the synthesizer and the trainee are of the same origin.

Answer 1 · 2025-06-17T02:41:30.000Z

Hi, thanks for your great tools. I want to train a smaller vlm model, can i use a large vlm model to generate data?

Answer 2 · 2025-06-20T02:28:48.000Z

Hi, thanks for your great tools. I want to train a smaller vlm model, can i use a large vlm model to generate data?

As I know, text-image alignment is the key point for VLM trainning. You may have to add image node into knowledge in GraphGen.