SOTU Text Generation

We have trained models on SOTU texts and then they can generate texts.

We have 5 language models

Clinton
GWBush
Obama
Trump
All of it (last4)

All models are stored here

See below for notes.

README-dev.md has more developer notes

Workflow

Experimenting

Training takes a while, easily 20 minutes to 4-5 hours.

We have two notebooks

one for training : for experimenting
and one for predicting

Scripting

And I have python scripts that generates multiple models in one go.

Model Serving

A flask application in model-serving directory. See model-serving/README.md for more details.

Experimenting with Models

I wanted to experiment with various models.
Pretty much all models overfit (large degree)

Data charecteristics

Data	Total number of words	Total unique words	max sequence len
Clinton	51,977	4,526	284
GWBush	44,282	4,701	182
Obama	53,895	4,957	132
Trump	22,349	3,539	159
all	172,503	8,934	284

Model 1 - Smaller Model

This model surprisingly works well.
And achieves pretty good training accuracy (around 90%) on most texts

# model 1
model_version = "1"
model = Sequential([
            Embedding(input_dim=num_unique_words, output_dim=100, input_length=max_sequence_len-1),
            Bidirectional(LSTM(64)),
            Dense(num_unique_words, activation='softmax')
    ])

model.compile(loss='categorical_crossentropy',
              optimizer = 'adam',
              metrics=['accuracy'])

cb_early_stop = tf.keras.callbacks.EarlyStopping(monitor='accuracy', 
                          min_delta=0.01, patience=10, verbose=1)

history = model.fit(xs, ys, validation_split=0.2, epochs=500, 
                    verbose=1, callbacks=[tensorboard_callback, cb_early_stop])

Data	Parameters	Model Size	Epochs	Accuracy %	Training Time
Clinton	1,120,934	13.5 MB	130	92.12	1 hour, 57 minutes and 48 secs
GWBush	1,161,009	14.0 MB	109	96.96	1 hour, 2 minutes and 22 seconds
Obama	1,219,633	14.7 MB	146	89.15	1 hour, 19 minutes and 48 seconds
Trump	894,911	10.8 MB	90	94.41	22 minutes and 11.32 seconds
all	2,130,366	25.6 MB	99	63.70	4 hours, 48 minutes and 30 seconds

Model 2 - 2 biLSTMs

model = Sequential([
            Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len-1),
            Bidirectional(LSTM(64, return_sequences=True)),
            Bidirectional(LSTM(64)),
            Dense(total_words, activation='softmax')
    ])

model.compile(loss='categorical_crossentropy', 
              optimizer = 'adam',
              metrics=['accuracy'])

cb_early_stop = tf.keras.callbacks.EarlyStopping(monitor='accuracy', 
                        min_delta=0.05, patience=20, verbose=2)

history = model.fit(xs, ys, validation_split=0.2, epochs=500, 
                    verbose=1, callbacks=[tensorboard_callback, cb_early_stop])

Data	Parameters	Model Size	Epochs	Accuracy %	Training Time
Clinton	1,219,750	14.7 MB	148	91.35	3 hours, 21 minutes and 25.89 seconds
GWBush	1,259,825	12.2 MB	129	94.72	1 hour, 38 minutes and 45.70 seconds
Obama	1,318,449	15.9 MB	165	87.41	2 hours, 19 minutes and 3.25 seconds
Trump	993,727	12.0 MB	136	93.24	51 minutes and 54.15 seconds
all	2,229,182	26.8 MB	100	59.06	7 hours, 34 minutes and 17.98 seconds

Model 3 - 2 biLSTMs + Dropout

model = Sequential([
            Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len-1),
            Bidirectional(LSTM(64, return_sequences=True)),
            Dropout(0.3),
            Bidirectional(LSTM(64)),
            Dense(total_words, activation='softmax')
    ])

model.compile(loss='categorical_crossentropy', 
              optimizer = 'adam',
              metrics=['accuracy'])

cb_early_stop = tf.keras.callbacks.EarlyStopping(monitor='accuracy', 
                        min_delta=0.05, patience=20, verbose=2)

history = model.fit(xs, ys, validation_split=0.2, epochs=500, 
                    verbose=1, callbacks=[tensorboard_callback, cb_early_stop])

Data	Parameters	Model Size	Epochs	Accuracy %	Training Time
Clinton	1,219,750	14.7 MB	176	77.44	4 hours, 25 minutes and 0.92 seconds
GWBush	1,259,825	15.2 MB	174	81.99	2 hours, 18 minutes and 40.00 seconds
Obama	1,318,449	15.9 MB	152	73.32	2 hours, 9 minutes and 36.04 seconds
Trump	993,727	12.0 MB	158	86.68	56 minutes and 41.30 seconds
all	2,229,182	26.4 MB	117	41.79	6 hours, 25 minutes and 27.42 seconds

Model 4 - biLSTM + DropOut + LSTM + Dense

This is from Laurence Moroney's Shakespeare notebook

from tensorflow.keras import regularizers

## Model 4: from https://github.com/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%203%20-%20NLP/NLP_Week4_Exercise_Shakespeare_Answer.ipynb
    model_version = "4"
    model = Sequential( [
        Embedding(num_unique_words, 100, input_length=max_sequence_len-1),
        Bidirectional(LSTM(150, return_sequences = True)),
        Dropout(0.2),
        LSTM(100),
        Dense(num_unique_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)),
        Dense(num_unique_words, activation='softmax')
    ])
model.compile(loss='categorical_crossentropy', 
        optimizer = 'adam',
        # optimizer = Adam(lr=0.01)
        metrics=['accuracy'])
cb_early_stop = tf.keras.callbacks.EarlyStopping(monitor='accuracy', 
                        min_delta=0.05, patience=20, verbose=2)

history = model.fit(xs, ys, validation_split=0.2, epochs=500, 
                    verbose=1, callbacks=[tensorboard_callback, cb_early_stop])

Data	Parameters	Model Size	Epochs	Accuracy %	Training Time
Clinton	11,389,627	136.7 MB	163	65.56	2 hours, 48 minutes and 30.75 seconds
GWBush	12,221,101	146.7 MB	187	70.90	2 hours, 24 minutes and 30.99 seconds
Obama	13,495,981	162.0 MB	182	64.46	2 hours, 27 minutes and 30.20 seconds
Trump	7,258,199	87.2 MB	165	80.65	59 minutes and 38.71 seconds
all	41,723,279	500.7 MB	60	27.60	3 hours, 57 minutes and 24.50 seconds

sujee/sotu-text-generator