fujiki-1emon/toward-japanese-llms

Resources toward Japanese LLMs

Bellow are the list that I find are useful, helpful to build Large Language Models (LLMs) in Japanese.
- The list is not complete and there are other resources that I find are relevant but not yet listed.
Each resource in the list might have my comments on it.

Table of Contents

General
Pre-training datasets
Downsteam tasks
Tokenization
Models
Model Architecture
Training
Evaluation

General

A Survey of Large Language Models [arXiv]

Pre-training datasets

The Pile

The Pile: An 800GB Dataset of Diverse Text for Language Modeling [arXiv]
- noted: Pile-CC
The Pile has 0.07%（approx. 900 M chars）of Japanese texts based on the types of the characters.
- cf. The Pileの構成（なぜCerebras-GPTで日本語が使えるのか？）

The ROOTS

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

RedPajama-Data [github]

However, the Pile still seems better than RedPajama-Data - cf. https://twitter.com/BlancheMinerva/status/1652899628356960256?s=20

Donwstream tasks

Question Answering

Better Question-Answering Models on a Budget [ArXiv]
- Having briefly checked it, it looks interesting, but only compared their models to OPTs?

Tokenization

LINE Distill BERT

Word segmentation by MeCab+UniDic + subword tokenization by SentencePiece

Models

abeja/gpt-neox-japanese-2.7b

ABEJAで作った大規模GPTモデルとその道のり

Model Architecture

ABEJA GPTモデルにおけるアーキテクチャの工夫

Training

GPT-neoxの学習用にマルチノード並列学習環境を整えた with DeepSpeed

Fine-tuning

【インターンレポート】6.7B日本語モデルに対するLoRAチューニング
Self-Instruct: Aligning Language Model with Self Generated Instructions arXiv
- https://twitter.com/rasbt/status/1650866140892069892?s=20
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models arXiv Related tweet

Alignment

Learning to summarize from human feedback [arXiv]
Training language models to follow instructions with human feedback [arXiv]

RLHF

RLHF works because it's rating full sentences
- https://twitter.com/savvyRL/status/1651255588813443073?s=20
- https://twitter.com/mr_bay_area/status/1651594421551644678?s=20
  - Sequence Level Training with Recurrent Neural Networks [arXiv]

Evaluation

Reports

Performance report of rinna/japanese-gpt-1b fine-tuned on Japanese version (translation) of Dolly dataset [tweet]

Practical

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond [ArXiv]

misc.

How well does ChatGPT speak Japanese?

What I've developed