Resources toward Japanese LLMs

  • Bellow are the list that I find are useful, helpful to build Large Language Models (LLMs) in Japanese.
    • The list is not complete and there are other resources that I find are relevant but not yet listed.
  • Each resource in the list might have my comments on it.

Table of Contents

General

  • A Survey of Large Language Models [arXiv]

Pre-training datasets

The Pile

The ROOTS

RedPajama-Data [github]

Donwstream tasks

Question Answering

  • Better Question-Answering Models on a Budget [ArXiv]
    • Having briefly checked it, it looks interesting, but only compared their models to OPTs?

Tokenization

  • Word segmentation by MeCab+UniDic + subword tokenization by SentencePiece

Models

Model Architecture

Training

Fine-tuning

Alignment

  • Learning to summarize from human feedback [arXiv]
  • Training language models to follow instructions with human feedback [arXiv]

RLHF

Evaluation

Reports

  • Performance report of rinna/japanese-gpt-1b fine-tuned on Japanese version (translation) of Dolly dataset [tweet]
    • スクリーンショット 2023-04-29 11 19 01

Practical

  • Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond [ArXiv]

misc.

What I've developed