/KoGPT2-train

Train your own GPT2!

Primary LanguagePythonApache License 2.0Apache-2.0

This repository is NOT actively maintained. However, issues and security alerts will be monitored and potentially fixed. No, this is not directly compatible with HuggingFace transformers(and models based on it, incl. Kakao GPT-3). I do NOT provide active support requests with TF-torch model translations.

GPT2 Training code

Open In Colab GitHub GitHub All Releases contributions welcome GitHub stars

한국어 | English

  • THE SCRIPT THAT SUPPORTS TPUS PROPERLY(<10% TPU idle)
  • Fast tokenizer powered by HuggingFace/tokenizers
  • Live demo (currently unavailable) #
  • 1.5B GPT2 pretrained Korean model ( ~40G corpus )

Pretrained Model

GPT-2 Small to GPT-2 XL is tested. Not guaranteed to work for larger models.

Google Colab

[Colab Notebook]

Train

cd KoGPT2-train
export PYTHONPATH=.
python3 train/train_tpu.py --input_file gs://kogpt2/datasets/WEB* --output_dir gs://kogpt2/models/large --max_seq_length 2048 --save_checkpoints_steps 5000 --use_tpu true --tpu_name v3-2 --train_batch_size 16 --config_file configs/large.json --iterations_per_loop 1000 --learning_rate 1e-4

Disclaimer

The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks. Currently, the underlying model is same as GPT-2. I'm working on the alternating layers.

If you want GPT-2, just change the context token length from 2048 to 1024 and it's practically the same. Refer to the original paper for specific hyperparameter settings.

Acknowledgements

This research wouldn't have been possible without the TFRC program and NIPA's HPC Support Program.

Citation

@misc{KoGPT3,
  author = {Seungjae Kim},
  title = {KoGPT3 : Pretrained for Korean},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ksjae/KoGPT}},
}

Reference

Code based on https://github.com/imcaspar/gpt2-ml

https://github.com/google-research/bert

https://github.com/rowanz/grover

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)