This project demonstrates how Recurrent Neural Network (RNN) model can be used to generate synthetic TV/film script
This project will build and train an RNN based neural network to automatically generate our own Seinfeld fake TV script. it uses a small subset of the Seinfeld TV sit-com's script from 9 seasons to train our network. The trained network will then generate a new but synthetic TV script based on patterns it learned from the training data. RNNs are ideally suited for sequence problems such as these since they take advantage of the underlying structure of the data, namely, the order of the data points.
To build an RNN based Neural Network that'd accept a text corpus (TV/film script, book, etc.), learn from it, and then given a seed prime-word,
it'd generate a fake text snippet that'd look like if this generated text came from the original script/book.
Note that even though I have used Seinfeld script to train and generate the text, any other text corpus (tv/film script, books, text corpus) can be used to re-train the model and generate the relevant text.
- The dataset used in this project is a subset of the original Seinfeld dataset on Kaggle.
- The dataset is provided as part of this repo, and it's kept in
data/Seinfeld_Scripts.txt
as a plain text file.
The streamlit
based App is composed of two tabs, "ABOUT" and "SCRIPT WRITER".
Provides objective and usage of the app.
Accepts a text prompt/seed (prime-word) and the length (number of words) of the script to be generated. corpus (TV/film script, book, etc.). "Generate" button press will kick-off the scrip generation process to generate a text snippet that'd look like if this generated text came from the original script/book.
Note that the prime-word must be a word thats present in vocabulary of training text corpus. For reference the vocabulary can be downloaded directly from the writer tab.
Note that even though Seinfeld script is used to train and generate the text, any other text corpus (tv/film script, books, text corpus) can be used to re-train the model and generate the relevant text. There could be multiple use cases where the same text generation technique can be used.
- Dataset used in this project is provided by Udacity
- Above dataset is a subset taken from Kaggle, Seinfeld dataset
- GRU, RNN diagrams courtesy Daniel V. Godoy
To run the app locally...
- Create a conda or virtual environment and activate it
- install python 3.11.0 or above
- execute below commands from terminal/command-prompt
git clone https://github.com/sssingh/tv-script-generation-rnn
cd nlp_ner_summarization_classification
pip install jupyter
pip install -r requirements.txt
streamlit run src/app.py
NOTE: The model building and training code is in
notebooks/tv_script_generation.ipynb
jupyter notebook. If you wish to change the model architecture and re-train the network then use this notebook. To train the models, it's recommended to execute the notebook one cell at a time as some steps might take a while to execute. If a GPU is available (recommended), it'll use it automatically; otherwise, training will fall back to the CPU.
A fully trained model weights
rnn_model_weights.pt
and preprocessed data (preprocess.p
) trained on a small subset of Seinfeld script can be downloaded from here and here respectively.
Once model and data is downloaded (in same directory where notebook is kept) the TV Script can be generated from the notebook directly, follow below steps...
# Load the trained model
trained_rnn = load_model('rnn_model_weights.pt')
# You can choose a different length
gen_length = 400
# You can set the prime word to _any word_ in our dictionary, but it's best to start with a name for generating a TV script.
prime_word = 'elaine'
# Generate script and print
pad_word = SPECIAL_WORDS['PADDING']
generated_script = generate(trained_rnn, vocab_to_int[prime_word + ':'], int_to_vocab, token_dict, vocab_to_int[pad_word], gen_length)
print(generated_script)
-
load the Seinfeld TV script
Seinfeld_Scripts.txt
as a text blob -
Once data is loaded, we play around and explore data to view different script parts. This gives us a sense/structure of the data we'll be working with. For example, it is all lowercase text, and a newline character
\ n separates each new line of dialogue
. We could extract information such as...Dataset Stats
Roughly the number of unique words: 46367
Number of lines: 109233
The average number of words in each line: 5.544240293684143
The lines 0 to 10:jerry: do you know what this is all about? Do you know, why we are here? To be out, this is out...and out is one of the single most enjoyable experiences of life. people...did you ever hear people talking about we should go out? this is what theyre talking about...this whole thing, were all out now, no one is home. not one person here is home, were all out! there are people trying to find us, they dont know where we are. (on an imaginary phone) did you ring?, i cant find him. where did he go? he didnt tell me where he was going. he must have gone out. you wanna go out you get ready, you pick out the clothes, right? you take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...then youre standing around, what do you do? you go we gotta be getting back. once youre out, you wanna get back! you wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? where ever you are in life, its my feeling, youve gotta go.
jerry: (pointing at georges shirt) see, to me, that button is in the worst possible spot. the second button literally makes or breaks the shirt, look at it. its too high! its in no-mans-land. you look like you live with your mother.
george: are you through?
jerry: you do of course try on, when you buy?
george: yes, it was purple, i liked it, i dont actually recall considering the buttons.
We'd need to prepare the textual data for modeling, i.e., preprocess data, to be fed to a Neural Network as numbers.
- We define a
lookup-table
(dictionary) forpunctuation-symbols
as shown below and then replace each occurrence of punctuation with its corresponding symbol we defined.
punct = {'.': '||Period||', ',': '||Comma||', '"': '||Quotation_Mark||', ';': '||Semicolon||', '!': '||Exclamation_mark||', '?': '||Question_mark||', '(': '||Left_Parentheses||', ')': '||Right_Parentheses||', '-': '||Dash||', '\n': '||Return||' }
- The text corpus is then converted into
lowercase
letters - We then look through the whole text corpus and generate two mappings...
vocab_to_int
: This maps each uniqueword
in the text to an uniqueinteger
(value). This map will be used to encode the text into numbers to feed to our network.int_to_vocab
: This is the reverse of the above map where each uniqueinteger
(key) is mapped to a uniqueword
(value). This map will convert the model-generated output, which is numbers back to the text to be presented back to the user as text.
- Using the
vocab_to_int
map, we encode our text corpus into a massive list of numbers. - Since data preprocessing done above can take quite a while to complete, as a final step, we serialize (
pickle
) all three maps created above and encoded text as python objects in thepreprocess.p
file. We do this so that if we need to run our notebook from the start again, we do not need to run preprocessing again; it'd save us a lot of time.
The problem at hand mandates our network to learn from sequential input data. Data is sequential if elements appear in a particular order, such as the words of a sentence in a TV/Film script. Our network is then expected to produce another sequential data (a fake TV/Film script) that'd appear to have come from the original sequential data corpus. In other words, our objective is to generate meaningful
text. Our network will consume a starting word (prime word
), and then it'd try to predict the next most-likely word based on the prime word provided. It'd then try to predict the next word based on previously predicted words and so on. Unlike a conventional dataset with a well-defined set of input
features (independent variables) and target
features (dependent variable) for models to learn from, the textual data does not offer the apparent input and target segregation. Sequential textual data needs to be broken down to produce input
features (these would be a sequence of words in our case) and target
feature (the next word that appears next in the given text). The inputs and targets are then paired to create a single data point (data sample), and data points are then batched together to make a training batch of a specified size that can be fed to a neural network.
We have defined a function batch_data(words, sequence_length, batch_size)
to perform the data batching operation. The function would split the encoded input-text (words
) into sequences of words; each sequence will have an equal number of words. The first sequence will start from first-word up to sequence_length
following words in the input text, where each word in the sequence can be considered as independent-variable. This sequence will then be paired with next-word in the input text as dependent-variable (target). The paired sequence and its target is a single data point. This process is repeated, but this time sequence starts from second-word
in the input text and so on. Finally, data points are packed into batches of given batch_size using TensorDataset
and DataLoader
objects.
For example, say we have these as input:
words = [1, 2, 3, 4, 5, 6, 7]
sequence_length = 4
The first sequence would contain the values:
[1, 2, 3, 4] # independent-variable (input features)
And the corresponding target` should be the following word:
5 # dependent-variable (target)
Now the second sequence would contain the values (note that it's starting from the second word (i.e., 2) this time):
[2, 3, 4, 5] # independent-variable (input features)
6 # dependent-variable (target)
For example lets say we have the encoded input text as numbers from [0...52], a call to batch_data(test_text, sequence_length=5, batch_size=10)
will produce the 2 batches. The first batch (maximum ten sequences of maximum five words each) is shown below...
# independent-variable (input features)
tensor([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39],
[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49]])
# dependent-variable (target)
tensor([ 5, 10, 15, 20, 25, 30, 35, 40, 45, 50])
Note that above target
is the next word after the sequence (5 for the first sequence, 10 for the next sequence, and so on)
Since there are more than 50 words, the second batch will be prepared with leftover words, as shown below...
# independent-variable (input features)
tensor([[21387, 21387, 50, 51, 52]])
# independent-variable (input features)
tensor([0])
Couple of important things to note...
(1) Since there are only three words left and the sequence consumes them, the target is taken as the very first word of the input text (0 in this case)
(2) The first two numbers, 21387
, are the special padding character used to pad the small sequence to make them of equal size (5 in this case)
Since we deal with sequential data here, we'll use an RNN based neural network best suited for such a task. RNN is well suited for sequential data because the order in which the data element appears in sequences is critical. RNN cell keeps track of the cell state (hidden
state) of previous time-step
along with input from current time-step
to produce an output (which is again input to the next time-step as the previous hidden state). A typical high-level structure of the RNN cell is shown below...
A crucial point to note here is that even though two cells are shown one after the other in the right-side diagram, it's the same cell shown as unrolled just for clarity. This is because the same cell consumes its own previous hidden state
of the prior time step and the current time-step input and produces a hidden state
for the next time step. So, for example, each word in a sentence (sequence) can be considered a time step, and they will be processed one at a time, and the output (the resulting hidden state) is fed to the next time step. In essence, it's a loop-back
mechanism. This is the behavior of RNN cells that allows them to account for dependency among the input and extract semantic
meaning out of them.
Mathematically at time step t,
the hidden output of RNN is computed as follows...
As shown above, the tanh
(Hyperbolic Tangent) function is used as the activation function here. It's very typical of RNN networks to use tanh
instead of the ReLU
activation function used by other types of neural networks. This is done because ReLU
has no upper bound; however, tanh
maps the feature space to the interval (-1, 1). This ensures that, at every step of the sequence, the hidden state is always between -1 and 1. Given that we have only one linear layer to transform the hidden state, regardless of which step of the sequence it uses, it is convenient to have its values within a predictable range.
An RNN cell structure at a lower level (neuron level) is shown below...
The number of hidden neurons is the hyperparameter
of the network, chosen to be 2 in the above diagram (hence two blue neurons in the hidden state section). The number of input neurons is automatically determined to be exactly same as the number of hidden neurons because both transformed input and hidden state need to be added together; hence, their shape must match. But, the input features (words) dimension could be anything. The input features dimension is another hyperparameter of the network. The above diagram shows a two-dimensional (x0 and x1) vector. In this project, we will use an Embedding
layer that'd transform each word that we have encoded as single unique number to its multi-dimensional vector representation.
- In practice, the plain vanilla RNN shown above is hardly used nowadays, where equal importance is given to both the previous-hidden-state and the current input. What if the previous-hidden-state contains more information than the new-hidden-state or current-input adds more information than the previous-hidden-state? There is no mechanism in vanilla RNNs to assign weights (how much to keep or ignore) to previous-hidden-state, new-hidden-state, and current-input. The improved variants of RNN such as
LSTM
(Long Short Term Memory) orGRU
(Gated Recurrent Unit) tackle these issues. In practice, an LSTM or GRU cell is used to build RNN based neural net. In this project, we will use the GRU cell because it is slightly lighter than the LSTM cell and trains a bit faster. If LSTM is used, we can get better results than what we are producing with GRU, but it might take much longer to train. A typical GRU cell structure is shown below...
- We have defined the network as a subclass of the
nn.Module
Pytorch class. - The network architecture is shown below...
- The very first layer of the network is an
Embedding
layer which takes an input vector of size equal to our vocabulary (number of unique words in our training text corpus) and outputs a vector representation of each of the words in the vocabulary. The size of this vector is yet anotherhyperparameter
calledembedding dimension
that we define. We have decided to use300
as the embedding vector size. Note that the vector representation of words is another weight matrix that the embedding layer learns during the training. So an embedding layer is nothing but a giant lookup table of weights where each row corresponds to a unique word in our vocabulary. We look up this table for a given word and extract anembedding dimension
long vector representation of the word, which is then fed to the next layer in the network. - The next layer is a
GRU
RNN cell that gets word vectors (embedding _dim long) from the embedding layer as input and produceshidden_dim
long outputs. Here we are using two layers of GRU cell, one stacked on top of the other hencen_layers
parameter is set to 2. - We are using a 50%
dropout
for GRU cell and 50% dropout between GRU cell and the final fully-connected layer
The shape of input and output produced by each layer of RNN could sometimes get confusing. Hence, it helps if we look closely at the input shape that goes into a layer and the output shape that comes out of it. Because of some reason (which is beyond my mental ability!) Pytorch RNN cells (RNN, GRU, LSTM) expect data input and produce outputs in shape where sequence_length
is the first dimension, and batch_size
is the second. However, we generally get batch_size
as the first dimension from Dataloaders. To use batch_size
as the first dimension with RRN, we need to use the batch_first = True
parameter; you can see this being used with our GRU cell. The batch_size
make Pytorch RNN cells handle tensors with batch_size as the first dimension for inputs and outputs, but the hidden-state would still have batch_size as the second dimension. Since we seldom directly deal with hidden-state, it's not a big issue, but it's better to be aware of this Pytorch RNN quirk.
For our network we have selected below hyperparameters to use...
batch_size = 64
, sequence_length = 15
, embedding_dim = 300
, hidden_dim = 512
, n_layers = 2
RNN cells take two input tensors...
- The input-sequence with shape
(batch_size, sequence_length, input_dim)
--> (64, 15, 300) (input_dim is the output of embedding_layer i.e. embedding_dim) - The hidden-state with shape
(n_layers, batch_size, hidden_dim)
--> (2, 64, 512)
And produces two output tensors...
- Output states as output (hidden state) for all time steps with shape
(batch_size, sequence_length, hidden_dim)
--> (64, 15, 512) - Final hidden-state representation of the full sequence with shape
(n_layers, batch_size, hidden_dim)
--> Note that batch_size as second dim (2, 64, 512) (These would be fed as previous hidden-state step in next cycle)
In the forward
function of the neural network, we have...
- First, Embedding layer that takes word sequence as input in shape (64, 15, 1) and produces output shape (64, 300)
- Next, the GRU layer takes the output of the Embedding layer in shape (64, 300) as a first input and hidden state (from previous time-step or initial hidden weights if it's a very first time-step) as second input in shape (2, 64, 512). It then produces the
cell-output
in shape (64, 15, 512) andcell-hidden-state
in shape (2, 64, 512) - 50% dropout is then applied to GRU cell outputs
- GRU cell output is then flattened to (64 * 15, 512) and fed to the
fully-connected
linear layer - The output of
fully-connected
layer is re-shaped into (batch_size, sequence_length, self.output_size) --> (64, 15, 512) - the last time-step output is then extracted for each batch.
- Above extracted output and the hidden-state from GRU are then returned to the training loop to be used in the next iteration of training.
We use the Adam
optimizer with 0.0007
learning rate and CrossEntropyLoss
as the loss function for the network training. Based on numerous experimentations, the final learning rate and hyperparameters listed above were selected...
- At first, kept lr = 0.001 hidden_dim large enough at 256 and embedding dim at 200 with seq_len as 5, but the loss did not seem to decrease.
- Keeping all param same increased seq_len slowly and found that seq_len = 15 making it go down but stopped decreasing.
- Then increased hidden_dim from 128 to 512 gradually and embedding dim from 100 to 300 and observed that loss started to decrease, but in the end, it fluctuated a lot.
- Keeping all other params same tried various learning-rate and found that 0.0007 is making loss go down consistently.
- Also, experimented with n_layers --> when n_layers = 3 loss starts to increase in a couple of epochs when n_layers = 1 loss does not decrease and saturated quickly
With the above experimentation, we finally settled on the below hyperparameters...
sequence_length = 15
batch_size = 64
num_epochs = 10
learning_rate = 0.0007
embedding_dim = 300
hidden_dim = 512
n_layers = 2
- Trained the model with the above hyperparameters for ten epochs, achieved a loss of ~2.5, and saved/serialized the model as
trained_rnn.pt
Training for 10 epoch(s)...
Epoch: 1/10 Loss: 5.568603610992431
Epoch: 2/10 Loss: 4.7756297567817185
Epoch: 3/10 Loss: 4.38765674001819
Epoch: 4/10 Loss: 4.071462697695352
Epoch: 5/10 Loss: 3.7681871848932764
Epoch: 6/10 Loss: 3.480992128825932
Epoch: 7/10 Loss: 3.196643516758158
Epoch: 8/10 Loss: 2.948043690559297
Epoch: 9/10 Loss: 2.709813594689795
Epoch: 10/10 Loss: 2.509170540952323
Model Trained and Saved
- Loaded the serialized model and preprocessed data and then generated a _fake_script of 400 words shown below...
We can easily see that the fake generated script snippet came from the original TV script; however, it's far from perfect. Multiple characters say (somewhat) complete sentences. All sentences do not make sense, but it doesn't have to be perfect! It takes quite a while and a lot of experimentation and resources to get good results for such a complex task. We can use smaller vocabulary (and discard uncommon words) or get more data to improve outcomes. Still impressive to see how we can get a very trivial network to mimic a human-generated text up to some degree of accuracy with little effort.
Experiment with different prime-word and different script lengths generates some engaging (and funny) text; give it a try!