A few words

Hi there! 😃 Like you, I'm an NLP researcher looking for a solution to run GPT-2 project on Tensorflow 2.x. As you may already know, GPT-2 has been developed by OpenAI team and use Tensorflow 1.x as a base framework and unfortunately, Tensorflow 1.x and 2.x are very different and slightly hard to upgrade from 1.x version to 2.x. Tensorflow 2.x removed some libraries / modules that are frequently used in version 1.x such as tf.contrib and also moved / modified a few other ones such as "hparams". In GPT-2 original source code, some parts were broken because of these problems. To use GPT-2 for my private project, I planned to rewrite all source code by Tensorflow 2.0 to fit my old codes and to better understand GPT-2 model. But I realized that my deadlines were killing me 😛 and in fact, maybe a new implementation could not achieve the best performance of GPT-2. So, I decided to change the GPT-2 source code with some of the smallest possible differences to keep its capabilities. To do it, I cloned a project from a super nice guy: Awesome GPT-2 with training script and do some improvements. This is what I have changed:

  • Add Hparams class to replace "tf.contrib.training.HParams", you can find it in src/hparams.py file.
  • Add "graph_def_editor" module to replace the "graph_editor" module. This is an awesome project written by CODAIT. I will place the path of this project in here.
  • I have added a new option for the training script of nshepperd to choose GPU or CPU device. You can find more details in train.py file
  • And a bunch of other minor changes ...

Testing project

To test this project, you can use pip or whatever you want to set up your environment. If you like me, an Anaconda user 😃, I prepared for you an environment file to make your life easier. You can follow these steps below:

  1. Install anaconda
  2. cd ./src
  3. conda env create -f ./environment.yml -p ./.env
  4. conda activate ./.env
  5. To train and test, you can follow instructions from nshepperd project I added below 😃. Remember, you can choose which device your model runs on by '--device' flag 😛.

Description of nshepperd and openai projects is below

Code from the paper "Language Models are Unsupervised Multitask Learners".

We have currently released small (117M parameter) and medium (345M parameter) versions of GPT-2. While we have not released the larger models, we have released a dataset for researchers to study their behaviors.

This repository is meant to be a starting point for researchers and engineers to experiment with GPT-2.

Some caveats

  • GPT-2 models' robustness and worst case behaviors are not well-understood. As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important.
  • The dataset our GPT-2 models were trained on contains many texts with biases and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
  • To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.

Fine tuning on custom datasets

To retrain GPT-2 117M model on a custom text dataset:

PYTHONPATH=src ./train.py --dataset <file|directory|glob>

If you want to precompute the dataset's encoding for multiple runs, you can instead use:

PYTHONPATH=src ./encode.py <file|directory|glob> /path/to/encoded.npz
PYTHONPATH=src ./train.py --dataset /path/to/encoded.npz

Make sure cudnn is installed. Some have reported that train.py runs without it but has worse memory usage and might OOM.

Gradient Checkpointing

https://github.com/openai/gradient-checkpointing is included to reduce the memory requirements of the model, and can be enabled by --memory_saving_gradients. The checkpoints are currently chosen manually (poorly) by just adding layer 10 to the 'checkpoints' collection in model.py. --memory_saving_gradients is enabled by default for training the 345M model.

Validation loss

Set --val_every to a number of steps N > 0, and "validation" loss against a fixed sample of the dataset will be calculated every N steps to get a better sense of training progress. N around 200 suggested. You can set --val_dataset to choose a separate validation dataset, otherwise it defaults to a sample from the train dataset (so not a real cross-validation loss!).


You can use SGD instead of Adam with --optimizer sgd. This also helps conserve memory when training the 345M model. Note: the learning rate needs to be adjusted for SGD, due to not having Adam's gradient normalization (0.0006 seems to be a good number from some experiments).

Multi gpu (out of date)

To do distributed on multiple GPUs or machines using Horovod:

mpirun -np 4 \
    -H localhost:4 \
    -bind-to none -map-by slot \
    -x PYTHONPATH=src \
    -mca pml ob1 -mca btl ^openib \
    /home/jovyan/gpt-2/train-horovod.py --dataset encoded.npz

GPT-2 samples

WARNING: Samples are unfiltered and may contain offensive content.

WARNING: Samples are unfiltered and may contain offensive content.


Future work

We may release code for evaluating the models on various benchmarks.

We are still considering release of the larger models.
