huggingface/smol-course

Finish Synthetic Datasets module

Closed this issue · 1 comments

The evaluation module is not complete. It requires a finalised structure, some more informations, and exercises.

Structure
Here is a basic proposal for a structure:

  • what's synthetic data
  • synthetic data for instruction tuning + adding seed knowledge (magpie, selfinstruct)
  • synthetic data for preference tuning + llm evals + adding seed knowledge (idem + response generation + ultrafeedback)
  • improving synthetic data (injecting diversity, evolving/deita)
  • evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

project

  • create a SFT dataset
  • transform the SFT dataset to preference dataset
  • improve preference dataset
  • evaluate and compare the improved and basic dataset

Comments

Great. Thanks for outlining this.

The material you aligned sounds good. For now, I would focus on getting the core material for synthetic data, and structuring it like the other modules. Which would be something like this:

  • README
  • instruction_datasets.md
    • Magpie
    • SelfInstruct
  • preference_datasets.md
    • UltraFeedback
  • notebooks/
    • sft dataset project
    • dpo dataset project

I would say this is the minimum which aligns with the previous modules.

improving synthetic data (injecting diversity, evolving/deita)
evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

I would say that these are good extras, which we can come back to if we have time.