open-sciencelab/GraphGen

[doc] Best Practice

Opened this issue · 2 comments

Result

Here is our final train result which 50% SFT data comes from GraphGen.

Domain Dataset our-7B-model Qwen2.5-7B-Instruct
Plant SeedBench 65.9 51.5
Common CMMLU 73.6 75.8
Logic GPQA-Diamond 40.0 33.3
Math AIME24 20.6 16.7
AIME25 22.7 7.2

Garbage in, garbage out

First, it's essential to ensure the high quality of the input chunk.

  • Positive example: A complete small story segment
  • Negative example: A part of a paper citation, only have title, lack of information

Secondly, filter the QA pairs according to business needs. The synthetic QA data contains entity words, but not every entity word should be present.

  • Positive example: The glorious deeds of the company's boss
  • Negative example: Meaningless coreference resolution.. "fig 5.1", "it"

API usage

  1. Make sure LLM API supports logprobs (such as vllm serve with v0.6.6post1) and enable Trainee Model for hardcase mining. SiliconCloud on OpenXLab web is just for free trial, real production would not be free.Image

Image

  1. Use a bigger synthesizer model. Ensure that the synthesizer and the trainee are of the same origin.

Image

Hi, thanks for your great tools. I want to train a smaller vlm model, can i use a large vlm model to generate data?

Hi, thanks for your great tools. I want to train a smaller vlm model, can i use a large vlm model to generate data?

As I know, text-image alignment is the key point for VLM trainning. You may have to add image node into knowledge in GraphGen.