Attempt to use glove character embeddings causing strange errors.
Hellisotherpeople opened this issue · 0 comments
Hellisotherpeople commented
Hello! I've been working on the text-generation example from Keras, and I saw your code. I tried to reverse engineer what I already had to work with your character embeddings. Unfortunately, it's introducing a very strange error which I am not finding documentation about and I am unsure about the cause of. My first thought is that I've failed to get the embedding layer into the model properly, but it trains successfully (?). If you could offer me some assistance with fixing this, I'd be greatly in your debt!
here's example stacktrace / output
275456/283158 [============================>.] - ETA: 4s - loss: 1.4438e-05
275584/283158 [============================>.] - ETA: 4s - loss: 1.4431e-05
275712/283158 [============================>.] - ETA: 4s - loss: 1.4425e-05
275840/283158 [============================>.] - ETA: 4s - loss: 1.4418e-05
275968/283158 [============================>.] - ETA: 3s - loss: 1.4411e-05
276096/283158 [============================>.] - ETA: 3s - loss: 1.4405e-05
276224/283158 [============================>.] - ETA: 3s - loss: 1.4398e-05
276352/283158 [============================>.] - ETA: 3s - loss: 1.4391e-05
276480/283158 [============================>.] - ETA: 3s - loss: 1.4385e-05
276608/283158 [============================>.] - ETA: 3s - loss: 1.4378e-05
276736/283158 [============================>.] - ETA: 3s - loss: 1.4371e-05
276864/283158 [============================>.] - ETA: 3s - loss: 1.4365e-05
276992/283158 [============================>.] - ETA: 3s - loss: 1.4358e-05
277120/283158 [============================>.] - ETA: 3s - loss: 1.4351e-05
277248/283158 [============================>.] - ETA: 3s - loss: 1.4345e-05
277376/283158 [============================>.] - ETA: 3s - loss: 1.4338e-05
277504/283158 [============================>.] - ETA: 3s - loss: 1.4332e-05
277632/283158 [============================>.] - ETA: 3s - loss: 1.4325e-05
277760/283158 [============================>.] - ETA: 2s - loss: 1.4318e-05
277888/283158 [============================>.] - ETA: 2s - loss: 1.4312e-05
278016/283158 [============================>.] - ETA: 2s - loss: 1.4305e-05
278144/283158 [============================>.] - ETA: 2s - loss: 1.4299e-05
278272/283158 [============================>.] - ETA: 2s - loss: 1.4292e-05
278400/283158 [============================>.] - ETA: 2s - loss: 1.4285e-05
278528/283158 [============================>.] - ETA: 2s - loss: 1.4279e-05
278656/283158 [============================>.] - ETA: 2s - loss: 1.4272e-05
278784/283158 [============================>.] - ETA: 2s - loss: 1.4266e-05
278912/283158 [============================>.] - ETA: 2s - loss: 1.4259e-05
279040/283158 [============================>.] - ETA: 2s - loss: 1.4253e-05
279168/283158 [============================>.] - ETA: 2s - loss: 1.4246e-05
279296/283158 [============================>.] - ETA: 2s - loss: 1.4240e-05
279424/283158 [============================>.] - ETA: 2s - loss: 1.4233e-05
279552/283158 [============================>.] - ETA: 1s - loss: 1.4227e-05
279680/283158 [============================>.] - ETA: 1s - loss: 1.4220e-05
279808/283158 [============================>.] - ETA: 1s - loss: 1.4214e-05
279936/283158 [============================>.] - ETA: 1s - loss: 1.4207e-05
280064/283158 [============================>.] - ETA: 1s - loss: 1.4201e-05
280192/283158 [============================>.] - ETA: 1s - loss: 1.4194e-05
280320/283158 [============================>.] - ETA: 1s - loss: 1.4188e-05
280448/283158 [============================>.] - ETA: 1s - loss: 1.4181e-05
280576/283158 [============================>.] - ETA: 1s - loss: 1.4175e-05
280704/283158 [============================>.] - ETA: 1s - loss: 1.4168e-05
280832/283158 [============================>.] - ETA: 1s - loss: 1.4162e-05
280960/283158 [============================>.] - ETA: 1s - loss: 1.4155e-05
281088/283158 [============================>.] - ETA: 1s - loss: 1.4149e-05
281216/283158 [============================>.] - ETA: 1s - loss: 1.4142e-05
281344/283158 [============================>.] - ETA: 0s - loss: 1.4136e-05
281472/283158 [============================>.] - ETA: 0s - loss: 1.4129e-05
281600/283158 [============================>.] - ETA: 0s - loss: 1.4123e-05
281728/283158 [============================>.] - ETA: 0s - loss: 1.4117e-05
281856/283158 [============================>.] - ETA: 0s - loss: 1.4110e-05
281984/283158 [============================>.] - ETA: 0s - loss: 1.4104e-05
282112/283158 [============================>.] - ETA: 0s - loss: 1.4097e-05
282240/283158 [============================>.] - ETA: 0s - loss: 1.4091e-05
282368/283158 [============================>.] - ETA: 0s - loss: 1.4085e-05
282496/283158 [============================>.] - ETA: 0s - loss: 1.4078e-05
282624/283158 [============================>.] - ETA: 0s - loss: 1.4072e-05
282752/283158 [============================>.] - ETA: 0s - loss: 1.4066e-05
282880/283158 [============================>.] - ETA: 0s - loss: 1.4059e-05
283008/283158 [============================>.] - ETA: 0s - loss: 1.4053e-05
283136/283158 [============================>.] - ETA: 0s - loss: 1.4046e-05Epoch 00000: loss improved from inf to 0.00001, saving model to models/weights-improvement-00-0.0000-embeddings.hdf5
283158/283158 [==============================] - 154s - loss: 1.4045e-05
----- diversity: 0.2
----- Generating with seed: "riminal and responsible for not having h"
riminal and responsible for not having hTraceback (most recent call last):
File "/root/PycharmProjects/Keras/sandbox.py", line 292, in <module>
train()
File "/root/PycharmProjects/Keras/sandbox.py", line 246, in train
next_index = sample(preds, diversity)
File "/root/PycharmProjects/Keras/sandbox.py", line 202, in sample
probas = np.random.multinomial(1, preds, 1)
File "mtrand.pyx", line 4612, in mtrand.RandomState.multinomial
TypeError: object of type 'numpy.float64' has no len()
Here's my code
'''Example script to generate text from Nietzsche's writings.
At least 20 epochs are required before the generated text
starts sounding coherent.
It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.
If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''
from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.layers import LSTM, SimpleRNN, Dropout, Bidirectional, Embedding, GRU
from keras.callbacks import ModelCheckpoint
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import pprint
embeddings_path = "glove.840B.300d-char.txt"
embedding_dim = 300
text_file_kapital = open("Das_Kapital.txt", 'rb')
def preprocess(text_file):
##get rid of line breaks and non-ASCII
lines = []
for line in text_file:
line = line.strip().lower()
line = line.decode("ascii", "ignore")
if len(line) == 0:
continue
lines.append(line)
text_file.close()
text = " ".join(lines)
return text
text = preprocess(text_file_kapital)
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 10
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
sentences.append(text[i: i + maxlen])
next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))
print('Vectorization...')
#working
# X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
# y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
# for i, sentence in enumerate(sentences):
# for t, char in enumerate(sentence):
# X[i, t, char_indices[char]] = 1
# y[i, char_indices[next_chars[i]]] = 1
X = np.zeros((len(sentences), maxlen), dtype=np.int)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
X[i, t] = char_indices[char]
y[i, char_indices[next_chars[i]]] = 1
print('Processing pretrained character embeds...')
embedding_vectors = {}
with open(embeddings_path, 'r') as f:
for line in f:
line_split = line.strip().split(" ")
vec = np.array(line_split[1:], dtype=float)
char = line_split[0]
embedding_vectors[char] = vec
embedding_matrix = np.zeros((len(chars), 300))
#embedding_matrix = np.random.uniform(-1, 1, (len(chars), 300))
for char, i in char_indices.items():
#print ("{}, {}".format(char, i))
embedding_vector = embedding_vectors.get(char)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
# build the model: a bidirectional GRU
def build_model():
## working
# print('Build model...')
# model = Sequential()
# #model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
# model.add(Bidirectional(GRU(128, unroll=True, return_sequences=True), input_shape=(maxlen, len(chars))))
# model.add(Dropout(0.2))
# model.add(Bidirectional(GRU(128, unroll=True)))
# model.add(Dropout(0.2))
# model.add(Dense(len(chars)))
# model.add(Activation('softmax'))
#
# model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
# return model
print('Build model...')
model = Sequential()
model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
model.add(Bidirectional(GRU(32, unroll=True), input_shape=(maxlen, len(chars))))
model.add(Dropout(0.2))
# model.add(Bidirectional(GRU(128, unroll=True)))
# model.add(Dropout(0.2))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
return model
#model = load_model("models/weights-improvement-00-1.2807-biggggger.hdf5")
model = build_model()
model.summary()
def sample(preds, temperature=1.0):
# helper function to sample an index from a probability array
preds = np.asarray(preds).astype('float64')
preds = np.log(preds + 1e-6) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
# train the model, output generated text after each iteration
def train():
for iteration in range(1, 60):
filepath = "models/weights-improvement-{epoch:02d}-{loss:.4f}-embeddings.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
print()
print('-' * 50)
print('Iteration', iteration)
model.fit(X, y,
batch_size=128,
epochs=1, callbacks = callbacks_list)
start_index = random.randint(0, len(text) - maxlen - 1)
#model.save("models/testmodel.h5")
for diversity in [0.2, 0.5, 1.0, 1.2]:
print()
print('----- diversity:', diversity)
generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence
print('----- Generating with seed: "' + sentence + '"')
sys.stdout.write(generated)
for i in range(400):
# x = np.zeros((1, maxlen, len(chars)))
# for t, char in enumerate(sentence):
# x[0, t, char_indices[char]] = 1.
#
# preds = model.predict(x, verbose=0)[0]
# next_index = sample(preds, diversity)
# next_char = indices_char[next_index]
x = np.zeros((1, maxlen), dtype=np.int)
for t, char in enumerate(sentence):
x[0, t] = char_indices[char]
preds = model.predict(x, verbose=0)[0][0]
next_index = sample(preds, diversity)
next_char = indices_char[next_index]
generated += next_char
sentence = sentence[1:] + next_char
sys.stdout.write(next_char)
sys.stdout.flush()
print()
def predict(num_to_predict, temperatrue, seed):
start_index = random.randint(0, len(text) - maxlen - 1)
if len(seed) > 40:
print("Type fewer characters, you typed this man characters")
print(len(seed))
return 0
newstring = ""
space = 40 - len(seed)
for i in range(space):
newstring += " "
seed = newstring + seed
sentence = seed
generated = ''
#sentence = text[start_index: start_index + maxlen]
generated += sentence
print('----- Hey Marx, Tell me about: "' + sentence.strip() + '"')
sys.stdout.write(generated)
for i in range(num_to_predict):
x = np.zeros((1, maxlen, len(chars)))
for t, char in enumerate(sentence):
x[0, t, char_indices[char]] = 1.
preds = model.predict(x, verbose=0)[0]
next_index = sample(preds, temperatrue)
next_char = indices_char[next_index]
generated += next_char
sentence = sentence[1:] + next_char
#sys.stdout.write(next_char)
#sys.stdout.flush()
print()
pprint.pprint(generated)
train()
# ################# PREDICT
# cont = 0
#
# while cont == 0:
# newtext = input("What do you want me to tell you about?")
#
# if newtext == "1":
# cont = 1
# predict(500, 0.5, newtext.lower())