/gpt-2

Code for the paper "Language Models are Unsupervised Multitask Learners"

Primary LanguagePythonMIT LicenseMIT

To fine-tune your model using Google Colab go to:

Step-by-step guide on how to train GPT-2 on books using Google Colab

gpt-2

Code from the paper "Language Models are Unsupervised Multitask Learners".

See more details in our blog post.

Usage

This repository is meant to be a starting point for researchers and engineers to experiment with GPT-2.

The repository is forked from nshepperd who contributed some cool addition to the openai repo (e.g train.py). I have added a method called conditional_model() which is the same as the interactive conditional model except it is a method and return a dictionary which may be easier to work with if you have mutlipule sentences you need to the model to predict on.

Some caveats

  • GPT-2 models' robustness and worst case behaviors are not well-understood. As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important.
  • The dataset our GPT-2 models were trained on contains many texts with biases and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
  • To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.

Work with us

Please let us know if you’re doing interesting research with or working on applications of GPT-2! We’re especially interested in hearing from and potentially working with those who are studying

  • Potential malicious use cases and defenses against them (e.g. the detectability of synthetic text)
  • The extent of problematic content (e.g. bias) being baked into the models and effective mitigations

Development

See DEVELOPERS.md

Contributors

See CONTRIBUTORS.md

Fine tuning on custom datasets

To retrain GPT-2 117M model on a custom text dataset:

PYTHONPATH=src ./train.py --dataset <file|directory|glob>

If you want to precompute the dataset's encoding for multiple runs, you can instead use:

PYTHONPATH=src ./encode.py <file|directory|glob> /path/to/encoded.npz
PYTHONPATH=src ./train.py --dataset /path/to/encoded.npz

Make sure cudnn is installed. Some have reported that train.py runs without it but has worse memory usage and might OOM.

Gradient Checkpointing

https://github.com/openai/gradient-checkpointing is included to reduce the memory requirements of the model, and can be enabled by --memory_saving_gradients. The checkpoints are currently chosen manually (poorly) by just adding layer 10 to the 'checkpoints' collection in model.py. --memory_saving_gradients is enabled by default for training the 345M model.

Validation loss

Set --val_every to a number of steps N > 0, and "validation" loss against a fixed sample of the dataset will be calculated every N steps to get a better sense of training progress. N around 200 suggested. You can set --val_dataset to choose a separate validation dataset, otherwise it defaults to a sample from the train dataset (so not a real cross-validation loss!).

Optimizer

You can use SGD instead of Adam with --optimizer sgd. This also helps conserve memory when training the 345M model. Note: the learning rate needs to be adjusted for SGD, due to not having Adam's gradient normalization (0.0006 seems to be a good number from some experiments).

Multi gpu (out of date)

To do distributed on multiple GPUs or machines using Horovod:

mpirun -np 4 \
    -H localhost:4 \
    -bind-to none -map-by slot \
    -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
    -x PYTHONPATH=src \
    -mca pml ob1 -mca btl ^openib \
    /home/jovyan/gpt-2/train-horovod.py --dataset encoded.npz

GPT-2 samples

WARNING: Samples are unfiltered and may contain offensive content.

While we have not yet released GPT-2 itself, you can see some samples from it in the gpt-2-samples folder. We show unconditional samples with default settings (temperature 1 and no truncation), with temperature 0.7, and with truncation with top_k 40. We show conditional samples, with contexts drawn from WebText's test set, with default settings (temperature 1 and no truncation), with temperature 0.7, and with truncation with top_k 40.

Citation

Please use the following bibtex entry:

@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}

Future work

We may release code for evaluating the models on various benchmarks.

We are still considering release of the larger models.

License

MIT