story-generation

Story generation project using Seq2Seq networks

Requirements

Note: to install pymeteor you must use the Test PyPi server:
pip install --index-url https://test.pypi.org/simple/ pymeteor

Tested with Python 3.6.6, PyTorch 0.4.1, and Cuda 9.0.176

Download the Wikipedia 2014 + Gigaword 5 pre-trained vectors from GloVe, download link here.
Unzip the text files to the location data/glove.6B/

Usage

python3 story_generation.py or ./story_generation.py
All command line arguments are optional, and any combination (beides -h, --help) can be used.
Arguments:

-h, --help : Provides help on command line parameters
--epoch <epoch_value> : specify an epoch value to train the model for or load a checkpoint from
--embedding <embedding_type> : specify an embedding to use from: [glove, sg, cbow]
--loss <loss_dir> : specify a directory to load loss values from (requires files loss.dat and validation.dat)

Along with story_generation.py, several other files can be executed as standalone scripts:

perplexity_study.py allows the user to gather perplexity results from the best model saved in the obj directory, or a specific embedding type using the --embedding parameter
storygen/book.py provides use to parse or filter standalone text into new files. ./book.py -h for more information
util/display_loss.py allows the user to display the loss values for select word embeddings, with or without validation values. ./display_loss.py -h for more information
util/loss_analysis.py allows the user to view min/max loss values of a given file, or find loss values at a specific epoch. ./loss_analysis.py -h for more information

Current Results

Determined loading from checkpoints was not working correctly, and causing spikes in loss values when a model was loaded.
Trained a Continuous Bag of Words word2vec embedding for 500 epochs. With this word2vec embedding, and the pre-trained GloVe embedding, trained 2 models for 500 epochs, without loading any checkpoints.
The GloVe embedding seems to heavily outperform the word2vec embedding, with a minimum loss of value of 2.983 at the 250th epoch, and a final loss value of 3.083.
The results for each model are below:

A perplexity study was also ran on the model with the GloVe embedding, with the results in the below table:

Epochs		Training data	Testing data
250	Actual sentences	42.9893	736176.1818
	Random words	146455354.9	147674868.5
	Random sentences	295889.4025	299692.5997
500	Actual sentences	37.0408	4872462.827
	Random words	1270328937	6201033178
	Random sentences	27962670.27	2313469.597

Here, actual sentences refers to the score of given an input sentence, the perplexity of the model forcing it to evaluate to the real target sentence. Random words is such that given an input sentence, the perplexity of the model forcing it to evaluate a random target sentence of the same length as the real target sentence, but where each word was replaced by a random word selected from the corpus. Lastly, random sentences is such that given the input sentence, and forced to evaluate a target sentence of the same length as the real target sentence, but randomly chosen from the data.
Out of the three categories of studies, actual sentences should perform the best, as they are the real sentences to follow the input sentences. This will be followed by random sentences, as they are full real sentences, meaning they have a real sentence structure, but may have nothing to do with the input. Finally, random words should perform the worse, as they will likely not have a real sentence structure.
The above table shows that this is the case: training data usually outperforms testing data, actual sentences outperforms random sentences, which then outperformed random words. The 250th epoch model had an actual sentences value of 42.989, while this perplexity value was actually reduced to 37.041 for the 500th epoch of the same model. In fact, all training data perplexity values between the two models was decreased. However, the testing perplexity values all increased. For the extra 250 epochs the model trained for, it seemed to continue learning, but fitted too exactly to the training data, and became more "perplexed" by data it had not seen before. Finally, the actual perplexity themselves were extremely high for the testing data, further showing that perhaps the model hasn't learned enough to handle unseen data. One cause of this could be due to the lack for training data, with just over 5,000 sentence pairs.

Future work

Beyond correcting current drawbacks, such as checkpoint loading issues, high perplexity values, and a minimum loss value of 3, future work could include:

Training and testing a working model on corpora of different types, such as news articles or song lyrics
Training more custom embeddings, either current ones for much longer, or using GloVe to train custom word embeddings rather than word2vec

Previous Results

11/15/2018

Trained three word2vec embeddings on all Harry Potter texts: Skip-Gram and Continuous Bag of Words trained for 15 epochs, and Continuous Bag of Words trained for 300 epochs.
With these 3 word2vec embeddings, the previous GloVe embedding, and the default random embedding, trained five models for 500 epochs on the data.
The models still seem to be underfitting, with the word2vec embeddings outperforming the random embedding. GloVe embedding still performs the best. See the results for loss values in the figure below:

11/9/2018

Currently getting a minimum loss value of 2.977 until the loss spikes around the 30th epoch, as you can see in the figure below:

This seems to be due to the model beginning with the values from the above word embeddings, then breaking out and not being able to find the local optimum for the Harry Potter texts. An idea to correct this is to train our own word2vec on the Harry Potter texts.
The model is also underfitting (on the 100th epoch) when evaluated, predicting sentences with repeated words.
Evaluating this model with beam search (k=1), the average perplexity is 100,562.7942.
Evaluating this model with beam search (k=5), the average perplexity is 93,277.5684.

Retraining this model on 40 epochs, we get an minimum loss value of 2.959 at the 23rd epoch.
Evaluating this model at k=1 gives us a perplexity value of 675.8603.
Evaluating this model at k=5 gives us a perplexity value of 669.7439.
Viewing prediction results at this point in training the model, it is apparent that the model is not yet underfitting.
Refer to the figure below to see the chosen minimum loss value at epoch=22, before the loss value spikes at epoch=30.

zembrodt/story-generation