About the spec of instruction tuning dataset
HuangChiEn opened this issue · 2 comments
Thanks for releasing this amazing work.
Since both training dataset are currently not available on huggingface due to license concern.
Could you please provide the spec of instruction tuning dataset?
We want to find the alternative tradition chinese dataset for the same spec.
Spec :
1. the num of instruction sample (* K)
2. the num of seed task using to generate the task.
Thanks for your interest!
For IFT, v1.0 was trained on ~500k examples (all in mandarin) including manually written examples and examples from proprietary models. Also I wrote ~100 seed QA pairs and paraphrased by model-based approaches.
Lots of interesting mandarin instruction set are released on huggingface by the community. please check them out :)
btw i have re-listed our ift dataset on huggingface https://huggingface.co/datasets/yentinglin/traditional_mandarin_instructions