Awesome Japanese Self-Instruct.

Data License

This is the repo for the Awesome Japanese Self-Instruct, which aims to share high-quality data generated by GPT-4 for building an instruction-following LLMs with supervised learning and reinforcement learning in Japanese. The repo contains:

  • Japanese Instruction-Following Data generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.

The approach to generate this dataset is described in the paper: Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese.

License Notices: The dataset is CC BY NC 4.0 (allowing only non-commercial use), and models trained using the dataset should not be used outside of research purposes.

Process

We revisited the original self-instruct method.

Our method translates that small number of seed tasks into Japanese and manually post-edits them to achieve native-level quality. Then we utilize GPT-4 to generate high-quality Japanese data directly.

References

Stanford Alpaca

Citation

If you use our code in your research, please cite our work:

@inproceedings{sun2024rapidly,
   title={Rapidly Developing High-quality Instruction Data and Evaluation Benchmark for Large Language Models with Minimal Human Effort: A Case Study on Japanese},
   author={Sun, Yikun and Wan, Zhen and Ueda, Nobuhiro and Yahata, Sakiko and Cheng, Fei and Chu, Chenhui and Kurohashi, Sadao},
   booktitle={The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
   year={2024}
}