poetwriter: An OpenEdge ABL repository from ntindall

TO DOWNLOAD FINAL PDF WRITEUP:
https://github.com/ntindall/poetwriter/blob/master/tex/writeup.pdf?raw=true

---------------------------------------------------------------
- FINAL PROJECT: POETWRITER                                   -
---------------------------------------------------------------
- Mathieu Rolfo, Shalom Rottman-Yang, Nathan Tindall          -
- CS221 Autumn 2014                                           -
- Stanford University                                         -
- Prof Percy Liang                                            -
- TA Aparna Krishnan                                          -
- http://web.stanford.edu/class/cs221/                        -
- 12 December 2014                                            -
---------------------------------------------------------------

---------------------------------------------------------------
- CONTENTS                                                    -
---------------------------------------------------------------
* INTRODUCTION
* USAGE
* EXAMPLE
* FILES
* ALGORITHMIC APPROACH
* DATA

---------------------------------------------------------------
- INTRODUCTION                                                -
---------------------------------------------------------------
Our task for this project was to design a framework for poetry 
generation that allows a human to give a corpus as input and 
specify formal constraints, and have our intelligent creation 
generate novel, semantically meaningful poetry meeting those 
constraints. Our algorithm has special optimizations for the 
corpora of rap artists in an attempt to generate rap-styled 
poetry; however, the general case of the problem is one of 
poetic generation. In this project, our constraints are on the 
syllable counts within lines and the rhyming patterns between lines. 

---------------------------------------------------------------
- USAGE                                                       -
---------------------------------------------------------------
The algorithm can be invoked by running generator.py, with the 
following flags:
* (-n), default 1
  Order of the n gram model
* (-f), default "whitman.txt"
  Corpus file, used to train the language model.
* (-o), default 3
  The number of poems to output
* (-l), default 1
  The phrase length (number of lines that the n-gram wraps 
  around before reseeding).
* (-s)
  The type of corpus, in order to facilitate corpus cleanup
* (-p), default true
  Whether the model is probabalistic when selecting actions (if 
  not selected, chooses more frequent seeds first
* (-b), default 3
  The number of initial seeds to try before backtracking 
  (when the grammar is reseeded)
* (-r)
  The number of children seeds to try before backtracking, 
  selects the r most frequent children or the first r children
  randomly selected (if probabilistic).
* (-t), default quad
  The type of poetry to output (options: sonnet, haiku, eight, 
  octave, quad)
* (-v), default 0
  The verbosity of the program. A value of 1 or higher lets you
  watch the algorithm's search process.

---------------------------------------------------------------
- EXAMPLE                                                     -
---------------------------------------------------------------
# python generator.py -n 2 -f lyrics/eminem.txt -o 4
# python generator.py -f corpora/shakespeare.txt -n 2 -o 1 -s rap -l 2 -b 10 -t octave -r 3 -p 1
# python generator.py -f corpora/shakespeare.txt -n 2 -o 2 -l 2 -b 10 -t eight -r 3 -p 1 -v 1
# python generator.py -f corpora/chance.txt -n 2 -o 5 -l 2 -b 10 -t eight -r 3 -p 1 -v 1

---------------------------------------------------------------
- FILES                                                       -
---------------------------------------------------------------
baseline.py: Algorithmic baseline
* Trains language model with user specified corpus (-f), number
  of poems generated (-o), and model order (-n).
* Formed algorithmic backbone and framework on which to improve
* Archaic

generator.py: Main execution
* Run from command line to generate poetry, parses command line
  to orient hyperparameters of algorithm as specified by the
  user
* See "Usage" below for invocation
* Instantiates a corpus and runs n gram analysis
* Determines parameters of the poetry object
* Instantiates a grammar (language model) based upon the corpus
* Runs search algorithm
* Prints statistics 

grammar.py: Corpus and Grammar Classes
* The Corpus class performs n gram analysis, generating a
  frequency map, a word map, and a beginning word map.
* The Grammar class is instantiated from the data structures
  of the Corpus model, and is general enough to work in a non
  search oriented implementation.

new_word_data.p: Dictionary
* Generated using the Moby Pronunciation Dictionary and Moby Part of Speech Dictionary, the largest public domain dictionaries of their kinds with over 100,000 entries. 
* Our heuristic for syllable count is the number of unstressed vowels.
* Our heuristic for rhyming is the last vowel in the word and any trailing consonants
* Keys: words in plaintext
* Values: (IPA Transcription, IPA Suffix, Number of Syllables,
           Part of Speech, isNotPrepOrArticle)

poetry.py: Poetry and Line Classes
* Poetry class is comprised of Line classes. Evaluation returns
  true when all inter- and intra-line constraints have been 
  satisfied.
* Line class stores a list of the current words in the line, the current syllable count, and inter-line constraints it received or needs to propagate (which lines it must rhyme with)

rap_genius_scraper.rb: A scraper for rapgenius.com
* Used to aggregate rap corpora
* Takes two command line arguments: the song id of the desired artists' song, and the name of the output file.

searchutil.py: Search Algorithms
* Houses UCS and DFS implementations
* DFS code tweaked to return as soon as solution is found, 
  rather than searching all paths to find the best solution 
  based upon some cost.

util.py: Utility functions
* clean(string): Used to strip down corpus of a-zA-Z0-9 chars
* weightedRandomChoice(weightDict): Chooses item from a pool
* isNoun(word): Returns True if word is a noun (accessess 
  pickle dict)
* isNotPrepOrArticle(word): Returns True if word is not prep or
  article (acesses pickle dict)
* numSyllables(ipa_reading): Returns the number of syllables in
  an IPA Unicode String
* getSyllables(word): Returns the number of syllables in a word
  (acesses pickle dict)
* rhymeVowel(ipa_reading): Algorithm for determining IPA suffix
* rhyme(word1, word2): returns True if two words rhyme, acesses
  IPA dictionary
* Loads the pickle dictionary when imported

---------------------------------------------------------------
- ALGORITHMIC APPROACH                                        -
---------------------------------------------------------------
Our algorithm uses depth first search in order to solve the 
constraints of the poetry object. We have defined the search
problem in the following manner:

* startState: (current poem state, current seed)
* isGoal(state): state[0] is a completed poem
* succAndCost(state): performs search over the language model
  by iteratively re-seeding corpus in an attempt to satisfy
  the Poetry class. Words are added to lines, and paths dis-
  carded if they break the constraints of the model.

---------------------------------------------------------------
- Data		            	                              -
---------------------------------------------------------------
eliot.txt
* n=1: Mean time = 91.66s; Median time = 66.84s; Mean states = 4773.75; 
Median states = 3547.5; Completion rate = 100%; Semantic score = 1.35
* n=2: Mean time = 0.21s; Median time = .18s; Mean states = 163.05; 
Median states = 111.5; Completion rate = 0%; Semantic score = N/A
* n=3: Mean time = 0.09s; Median time = 0.09s; Mean states = 16.5; 
Median states = 16; Completion rate = 0%; Semantic score = N/A

whitman.txt
* n=1: Mean time = 1352.69s; Median time = 165.73s; Mean states = 4959; 
Median states = 532; Completion rate = 100%; Semantic score = 1.66
* n=2: Mean time = 18.203s; Median time = 12.101; Mean states = 1705.6; 
Median states = 1184; Completion rate = 100%; Semantic score = 2.45
* n=3: Mean time = 1.548s; Median time = 1.415s; Mean states = 80.2; 
Median states = 60.5; Completion rate = 0%; Semantic score = N/A

shakespeare.txt
* n=1: Mean time = 6480s; Median time = 6480s; Mean states = 9333.80; 
Median states = 397; Completion rate = 100%; Semantic score = 1.20
(Only five poems generated after 9 hour run time; only these data points reported)
* n=2: Mean time = 81.918s; Median time = 30.761s; Mean states = 3784; 
Median states = 1305.05; Completion rate = 100%; Semantic score = 2.18
* n=3: Mean time = 14.96s; Median time = 13.66s; Mean states = 1220.90; 
Median states = 1023; Completion rate = 45%; Semantic score = 3.22