Finish Synthetic Datasets module

Question

Finish Synthetic Datasets module

Closed this issue 15 days ago · 1 comments

davidberenstein1957 commented a month ago

The evaluation module is not complete. It requires a finalised structure, some more informations, and exercises.

Structure
Here is a basic proposal for a structure:

what's synthetic data
synthetic data for instruction tuning + adding seed knowledge (magpie, selfinstruct)
synthetic data for preference tuning + llm evals + adding seed knowledge (idem + response generation + ultrafeedback)
improving synthetic data (injecting diversity, evolving/deita)
evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

project

create a SFT dataset
transform the SFT dataset to preference dataset
improve preference dataset
evaluate and compare the improved and basic dataset

Comments

Answer 1 · 2024-12-11T07:46:13.000Z

Great. Thanks for outlining this.

The material you aligned sounds good. For now, I would focus on getting the core material for synthetic data, and structuring it like the other modules. Which would be something like this:

README
instruction_datasets.md
- Magpie
- SelfInstruct
preference_datasets.md
- UltraFeedback
notebooks/
- sft dataset project
- dpo dataset project

I would say this is the minimum which aligns with the previous modules.

improving synthetic data (injecting diversity, evolving/deita)
evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)

I would say that these are good extras, which we can come back to if we have time.