/harry-potter-gpt2-fanfiction

Scraping Harry Potter fan fic for machine learning content generation

Primary LanguagePython

Harry Potter GPT-2 fanfiction generator

,-_/,.                   .-,--.     .  .
' |_|/ ,-. ,-. ,-. . .    '|__/ ,-. |- |- ,-. ,-.
 /| |  ,-| |   |   | |    ,|    | | |  |  |-' |      **       *
 `' `' `-^ '   '   `-|    `'    `-' `' `' `-' '       **     **                 zz
                    /|                             **   **  **    ****         zz
                   `-'                               ***   *  *****           zzzzzz
,---. .-,--. ,--,--'                                                             zz
|  -'  '|__/ `- |    ,-,   ," ,-. ,-. ," . ,-.           xx  ****               zz
|  ,-' ,|     , | --  /    |- ,-| | | |- | |            xx      ***            zz
`---|  `'     `-'    '-`   |  `-^ ' ' |  ' `-'         xx
 ,-.|                      '          '               xx                oooooo          oooooo
 `-+'                                                xx                oo    oo        oo    oo
                        .                           xx                 o      oo ***** o      o
,-. ,-. ,-. ,-. ,-. ,-. |- ,-. ,-.                 xx                  o       o**   **o      o
| | |-' | | |-' |   ,-| |  | | |                  xx                   oo     oo       oo     o
`-| `-' ' ' `-' '   `-^ `' `-' '                 xx                     oooooo          oooooo
 ,|
 `'

Generate your own Harry Potter fanfiction with a pre-trained GPT-2 generative text model using huggingface's transformers.

This project has two parts: a scraper and a text generation model. The scraper fetches stories from fanfiction.net and creates text files ready for training.

The text generation bit lets you generate new fanfiction. We have pre-trained a model using the ~100 most popular HP fanfiction stories, but you can scrape a different set of stories and train your own model.

Examples

Faced with a pandemic of new wizarding disease, there wasn't much to do. The Ministry was struggling and the public had been sickened by Harry's expulsion from Hogwarts on his eleventh birthday – all things considered."Well," Dumbledore said finally as he looked at Pansy Parkinson in confusion once more; "I...


Faced with a pandemic of Death Eaters, it was hard not to be surprised that the Department for Regulation and Control (DCR) had been heavily infiltrated. Harry knew from his research on magical Britain's Ministry over its political alignment during World War Two how much influence there really were in wizarding society...


Hufflepuff lost by a hundred points to Slytherin, and the Quidditch team's seeker, Cedric Diggory took down his dragon. The game resumed soon after that as Harry made an appearance in front of everyone."Well I guess it was well deserved," Draco commented. "I don't think...


Hufflepuff won the House Cup by a landslide, and I don't think that's fair!" He said. "I know you can see it in your eyes! But what about him? You're letting him win because he was hurt?"A few of his friends looked alarmed at this statement."We should make sure...

Requirements

A computer with Git, Python 3 and pip installed.

Getting started

First, clone the repo:

$ git clone git@github.com:ceostroff/harry-potter-gpt2-fanfiction.git

Now, navigate to the folder and install the dependencies (we recommend setting up a virtualenv):

$ pip3 install -r requirements.txt

You should have everything you need to run the scraper and the model.

Generating fanfiction

To run the text generation script first give it permission to run:

$ chmod +x text_generation.sh

Now you can generate text using the default settings and the pre-trained model:

$ ./text_generation.sh

The script will ask you for a prompt: this is the initial text the model will use to generate a story. If you edit the file you can adjust the temperature ('randomness' of the generated text, default at 0.7), the length and the number of stories (by default 10 blobs of 60 words each). See the original file with all the options here.

Getting data

The scripts to collect the text from fanfiction stories is in the scrapers folder. The first scraper, fetchData.py will get the links for the first 1,000 links among the highest rated Harry Potter fanfiction stories. This will write those links to a csv called HPSummary.csv with other data about each story.

The script fetchContent.py will get each link to the story from the csv and write its contents to a text file locally under the data folder. The files will be named by the story ID. The file sortData.py will combine those with the needed delimeter and write them to a folder with one file to create a larger training.txt file and the other to create a smaller test.txt file.

Train your own model with AWS

You can train the model in your own computer but if you don't own a desktop with a dedicated graphics card and loads of memory it will take a long time. That's why we opted to train the model using an Ubuntu AWS instance with special machine-learning hardware.

Choose an instance type like g4dn.4xlarge (priced at ~$1 per hour). With those specs you can expect training a model with ~100mb of data in about 7 hours. Disk size can fill up quickly, so choose at least 64gb. When the instance is running, ssh inside and run the following commands to train the model:

  1. Open the terminal, clone this repo and navigate to the folder:
$ git clone git@github.com:ceostroff/harry-potter-gpt2-fanfiction.git
  1. Enter into the virtualenv:
$ source env/bin/activate
  1. Install pip:
$ sudo apt-get update && sudo apt-get install python3-pip

And the requirements:

$ pip3 install -r requirements.txt
  1. Install CUDA. Follow the instructions, and make sure to add CUDA to the PATH. Reboot the instance.

  2. Now you can train the model. Navigate to this folder again and run train_model.sh with:

./train_model.sh

It is important to set --per_device_train_batch_size=1 and same per --per_device_eval_batch_size=1. If you increase the value there's a very high chance that the training will run out of memory.

  1. If the training crashes or runs out of memory you can resume the training from the last checkpoint. To do that edit train_model.sh and change --output_dir to the path of the last model checkpoint. After that, run train_model.sh again.
python run_clm.py \
    --output_dir model/checkpoint-1000 \
    ...
  1. After the training is finished open a terminal in your own computer and copy the files from the server like this:
rsync ubuntu@YOUR_AWS_IP:~/harry-potter-gpt2-fanfiction/model/* .
  1. Voilà! Now you can turn off the AWS instance and run the text generation step from your own computer with:
$ ./text_generation.sh

Python cheatsheet

To activate the virtual environment:

$ source env/bin/activate

To update the requirements:

$ pip freeze > requirements.txt

To install the dependencies

$ pip install -r requirements.txt