prose2poetry

COMP 550 NLP Fall 2020 final project: Prose2poetry - Generating poetry from prose

Licenses

The code is licensed under the MIT license. Supplemental data (baseline corpora, etc.) have their own attributions in data/. Models are trained from scratch and stored in ./models. The models dir using a single novel as an input corpus takes a combined 2GB of space, so these are not included in the repo. Retraining the models from scratch doesn't take too much time (~10 minutes).

Install dependencies

Most of the dependencies are in the requirements.txt file. Install in a virtualenv (or your tool of choosing).

$ pip install -r ./requirements.txt

Code structure

There are 3 runnable scripts:

evaluate_metrics.py - load and score baselines and generated couplets
evaluate_doc2vec.py - evaluate semantic similarity in the generated couplets
prose2poetry.py - take seed words as an input and produce an output poem

The important code is in the embedded prose2poetry library:

generators.py for poetry generators including an LSTM and Markov chain model
rhyme_score.py contains our custom rhyme_score function using phoneme data from the CMUdict
couplet_score.py contains the couplet scorer which incorporates rhyme score on the end words, and a syllabic meter score
vector_models.py contains gensim Fasttext and doc2vec embedding models + training and loading code
corpora.py contains some classes to faciliate the loading and filtering of couplets from nltk's Gutenberg corpus, the Gutenberg Poetry corpus, and PoetryFoundation corpus (included in data)

Usage

When initially cloning the project, the data dir contains the baseline corpora stored with Git-LFS. You should confirm that the size of the data directory is 75M. If it isn't, you may need to run git-lfs pull.

The models directory on a fresh clone is empty. This is where FastText, doc2vec, and the LSTM models are stored after training. The first time you run ./evaluate_metrics.py or ./prose2poetry.py, the models will be trained and saved. To reset the training (e.g. if changing the input corpus), delete the contents of the models directory.

Example

Generating couplets from the novel Emma by Jane Austen, using the seed word "love":