This repository contains a dataset collected for NLI classification using GPT-3. Thanks to Greg Brockman at OpenAI for giving me access to the API.
Why was this dataset created? The goals of this project can be found in my blog post on Notion (https://www.notion.so/GPT3-Dataset-Task-Model-b97a267d6f5f44e688ba4f7ec85c00cc).
The dataset (data/dataset.jsonl
) contains 30000 examples in total. All of these examples are 'fake' and were generated by GPT-3. I used these to fine-tune a BERT model with moderate success as you'll see if you read my post.
I also included the output of each stage of my dataset creation process in data/
.
Disclaimer: this dataset has not been filtered in any way. If you notice any text that is offensive, I'll be happy to remove it from the data. This repository is not owned by or associated with OpenAI.
If you use this dataset, please include a link to my blogpost and this Github repository. Feel free to contact me at kgoel [at] cs [dot] stanford [dot] edu.