Finish Synthetic Datasets module
Closed this issue · 1 comments
davidberenstein1957 commented
The evaluation module is not complete. It requires a finalised structure, some more informations, and exercises.
Structure
Here is a basic proposal for a structure:
- what's synthetic data
- synthetic data for instruction tuning + adding seed knowledge (magpie, selfinstruct)
- synthetic data for preference tuning + llm evals + adding seed knowledge (idem + response generation + ultrafeedback)
- improving synthetic data (injecting diversity, evolving/deita)
- evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)
project
- create a SFT dataset
- transform the SFT dataset to preference dataset
- improve preference dataset
- evaluate and compare the improved and basic dataset
Comments
burtenshaw commented
Great. Thanks for outlining this.
The material you aligned sounds good. For now, I would focus on getting the core material for synthetic data, and structuring it like the other modules. Which would be something like this:
- README
- instruction_datasets.md
- Magpie
- SelfInstruct
- preference_datasets.md
- UltraFeedback
- notebooks/
- sft dataset project
- dpo dataset project
I would say this is the minimum which aligns with the previous modules.
improving synthetic data (injecting diversity, evolving/deita)
evaluating synthetic data (quality classifiers, llms as judges, filtering/deita)
I would say that these are good extras, which we can come back to if we have time.