About the spec of instruction tuning dataset

Question

About the spec of instruction tuning dataset

HuangChiEn opened this issue a year ago · 2 comments

Thanks for releasing this amazing work.
Since both training dataset are currently not available on huggingface due to license concern.

Could you please provide the spec of instruction tuning dataset?

We want to find the alternative tradition chinese dataset for the same spec.

Spec :
1. the num of instruction sample (* K)
2. the num of seed task using to generate the task.

Answer 1 · 2023-10-06T13:45:58.000Z

Thanks for your interest!

For IFT, v1.0 was trained on ~500k examples (all in mandarin) including manually written examples and examples from proprietary models. Also I wrote ~100 seed QA pairs and paraphrased by model-based approaches.

Lots of interesting mandarin instruction set are released on huggingface by the community. please check them out :)

Answer 2 · 2023-10-06T13:51:27.000Z

btw i have re-listed our ift dataset on huggingface https://huggingface.co/datasets/yentinglin/traditional_mandarin_instructions