minimaxir/char-embeddings

Attempt to use glove character embeddings causing strange errors.

Hellisotherpeople opened this issue · 0 comments

Hello! I've been working on the text-generation example from Keras, and I saw your code. I tried to reverse engineer what I already had to work with your character embeddings. Unfortunately, it's introducing a very strange error which I am not finding documentation about and I am unsure about the cause of. My first thought is that I've failed to get the embedding layer into the model properly, but it trains successfully (?). If you could offer me some assistance with fixing this, I'd be greatly in your debt!

here's example stacktrace / output

275456/283158 [============================>.] - ETA: 4s - loss: 1.4438e-05
275584/283158 [============================>.] - ETA: 4s - loss: 1.4431e-05
275712/283158 [============================>.] - ETA: 4s - loss: 1.4425e-05
275840/283158 [============================>.] - ETA: 4s - loss: 1.4418e-05
275968/283158 [============================>.] - ETA: 3s - loss: 1.4411e-05
276096/283158 [============================>.] - ETA: 3s - loss: 1.4405e-05
276224/283158 [============================>.] - ETA: 3s - loss: 1.4398e-05
276352/283158 [============================>.] - ETA: 3s - loss: 1.4391e-05
276480/283158 [============================>.] - ETA: 3s - loss: 1.4385e-05
276608/283158 [============================>.] - ETA: 3s - loss: 1.4378e-05
276736/283158 [============================>.] - ETA: 3s - loss: 1.4371e-05
276864/283158 [============================>.] - ETA: 3s - loss: 1.4365e-05
276992/283158 [============================>.] - ETA: 3s - loss: 1.4358e-05
277120/283158 [============================>.] - ETA: 3s - loss: 1.4351e-05
277248/283158 [============================>.] - ETA: 3s - loss: 1.4345e-05
277376/283158 [============================>.] - ETA: 3s - loss: 1.4338e-05
277504/283158 [============================>.] - ETA: 3s - loss: 1.4332e-05
277632/283158 [============================>.] - ETA: 3s - loss: 1.4325e-05
277760/283158 [============================>.] - ETA: 2s - loss: 1.4318e-05
277888/283158 [============================>.] - ETA: 2s - loss: 1.4312e-05
278016/283158 [============================>.] - ETA: 2s - loss: 1.4305e-05
278144/283158 [============================>.] - ETA: 2s - loss: 1.4299e-05
278272/283158 [============================>.] - ETA: 2s - loss: 1.4292e-05
278400/283158 [============================>.] - ETA: 2s - loss: 1.4285e-05
278528/283158 [============================>.] - ETA: 2s - loss: 1.4279e-05
278656/283158 [============================>.] - ETA: 2s - loss: 1.4272e-05
278784/283158 [============================>.] - ETA: 2s - loss: 1.4266e-05
278912/283158 [============================>.] - ETA: 2s - loss: 1.4259e-05
279040/283158 [============================>.] - ETA: 2s - loss: 1.4253e-05
279168/283158 [============================>.] - ETA: 2s - loss: 1.4246e-05
279296/283158 [============================>.] - ETA: 2s - loss: 1.4240e-05
279424/283158 [============================>.] - ETA: 2s - loss: 1.4233e-05
279552/283158 [============================>.] - ETA: 1s - loss: 1.4227e-05
279680/283158 [============================>.] - ETA: 1s - loss: 1.4220e-05
279808/283158 [============================>.] - ETA: 1s - loss: 1.4214e-05
279936/283158 [============================>.] - ETA: 1s - loss: 1.4207e-05
280064/283158 [============================>.] - ETA: 1s - loss: 1.4201e-05
280192/283158 [============================>.] - ETA: 1s - loss: 1.4194e-05
280320/283158 [============================>.] - ETA: 1s - loss: 1.4188e-05
280448/283158 [============================>.] - ETA: 1s - loss: 1.4181e-05
280576/283158 [============================>.] - ETA: 1s - loss: 1.4175e-05
280704/283158 [============================>.] - ETA: 1s - loss: 1.4168e-05
280832/283158 [============================>.] - ETA: 1s - loss: 1.4162e-05
280960/283158 [============================>.] - ETA: 1s - loss: 1.4155e-05
281088/283158 [============================>.] - ETA: 1s - loss: 1.4149e-05
281216/283158 [============================>.] - ETA: 1s - loss: 1.4142e-05
281344/283158 [============================>.] - ETA: 0s - loss: 1.4136e-05
281472/283158 [============================>.] - ETA: 0s - loss: 1.4129e-05
281600/283158 [============================>.] - ETA: 0s - loss: 1.4123e-05
281728/283158 [============================>.] - ETA: 0s - loss: 1.4117e-05
281856/283158 [============================>.] - ETA: 0s - loss: 1.4110e-05
281984/283158 [============================>.] - ETA: 0s - loss: 1.4104e-05
282112/283158 [============================>.] - ETA: 0s - loss: 1.4097e-05
282240/283158 [============================>.] - ETA: 0s - loss: 1.4091e-05
282368/283158 [============================>.] - ETA: 0s - loss: 1.4085e-05
282496/283158 [============================>.] - ETA: 0s - loss: 1.4078e-05
282624/283158 [============================>.] - ETA: 0s - loss: 1.4072e-05
282752/283158 [============================>.] - ETA: 0s - loss: 1.4066e-05
282880/283158 [============================>.] - ETA: 0s - loss: 1.4059e-05
283008/283158 [============================>.] - ETA: 0s - loss: 1.4053e-05
283136/283158 [============================>.] - ETA: 0s - loss: 1.4046e-05Epoch 00000: loss improved from inf to 0.00001, saving model to models/weights-improvement-00-0.0000-embeddings.hdf5

283158/283158 [==============================] - 154s - loss: 1.4045e-05   

----- diversity: 0.2
----- Generating with seed: "riminal and responsible for not having h"
riminal and responsible for not having hTraceback (most recent call last):
  File "/root/PycharmProjects/Keras/sandbox.py", line 292, in <module>
    train()
  File "/root/PycharmProjects/Keras/sandbox.py", line 246, in train
    next_index = sample(preds, diversity)
  File "/root/PycharmProjects/Keras/sandbox.py", line 202, in sample
    probas = np.random.multinomial(1, preds, 1)
  File "mtrand.pyx", line 4612, in mtrand.RandomState.multinomial
TypeError: object of type 'numpy.float64' has no len()


Here's my code

'''Example script to generate text from Nietzsche's writings.

At least 20 epochs are required before the generated text
starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''

from keras.models import Sequential, load_model
from keras.layers import Dense, Activation
from keras.layers import LSTM, SimpleRNN, Dropout, Bidirectional, Embedding, GRU
from keras.callbacks import ModelCheckpoint
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import pprint

embeddings_path = "glove.840B.300d-char.txt"
embedding_dim = 300


text_file_kapital = open("Das_Kapital.txt", 'rb')

def preprocess(text_file):

    ##get rid of line breaks and non-ASCII
    lines = []
    for line in text_file:
        line = line.strip().lower()
        line = line.decode("ascii", "ignore")
        if len(line) == 0:
            continue
        lines.append(line)
    text_file.close()
    text = " ".join(lines)
    return text

text = preprocess(text_file_kapital)


chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 10
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')

#working
# X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
# y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
# for i, sentence in enumerate(sentences):
#     for t, char in enumerate(sentence):
#         X[i, t, char_indices[char]] = 1
#     y[i, char_indices[next_chars[i]]] = 1

X = np.zeros((len(sentences), maxlen), dtype=np.int)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t] = char_indices[char]
y[i, char_indices[next_chars[i]]] = 1

print('Processing pretrained character embeds...')
embedding_vectors = {}
with open(embeddings_path, 'r') as f:
    for line in f:
        line_split = line.strip().split(" ")
        vec = np.array(line_split[1:], dtype=float)
        char = line_split[0]
        embedding_vectors[char] = vec

embedding_matrix = np.zeros((len(chars), 300))
#embedding_matrix = np.random.uniform(-1, 1, (len(chars), 300))
for char, i in char_indices.items():
    #print ("{}, {}".format(char, i))
    embedding_vector = embedding_vectors.get(char)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector



# build the model: a bidirectional GRU

def build_model():

    ## working
    # print('Build model...')
    # model = Sequential()
    # #model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
    # model.add(Bidirectional(GRU(128, unroll=True, return_sequences=True), input_shape=(maxlen, len(chars))))
    # model.add(Dropout(0.2))
    # model.add(Bidirectional(GRU(128, unroll=True)))
    # model.add(Dropout(0.2))
    # model.add(Dense(len(chars)))
    # model.add(Activation('softmax'))
    #
    # model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
    # return model

    print('Build model...')
    model = Sequential()
    model.add(Embedding(len(chars), embedding_dim, input_length=maxlen, weights=[embedding_matrix]))
    model.add(Bidirectional(GRU(32, unroll=True), input_shape=(maxlen, len(chars))))
    model.add(Dropout(0.2))
    # model.add(Bidirectional(GRU(128, unroll=True)))
    # model.add(Dropout(0.2))
    model.add(Dense(len(chars)))
    model.add(Activation('softmax'))

    model.compile(loss='categorical_crossentropy', optimizer="RMSprop")
    return model


#model = load_model("models/weights-improvement-00-1.2807-biggggger.hdf5")

model = build_model()
model.summary()

def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds + 1e-6) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

# train the model, output generated text after each iteration

def train():
    for iteration in range(1, 60):
        filepath = "models/weights-improvement-{epoch:02d}-{loss:.4f}-embeddings.hdf5"
        checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
        callbacks_list = [checkpoint]
        print()
        print('-' * 50)
        print('Iteration', iteration)
        model.fit(X, y,
                  batch_size=128,
                  epochs=1, callbacks = callbacks_list)

        start_index = random.randint(0, len(text) - maxlen - 1)
        #model.save("models/testmodel.h5")

        for diversity in [0.2, 0.5, 1.0, 1.2]:
            print()
            print('----- diversity:', diversity)

            generated = ''
            sentence = text[start_index: start_index + maxlen]
            generated += sentence
            print('----- Generating with seed: "' + sentence + '"')
            sys.stdout.write(generated)

            for i in range(400):
                # x = np.zeros((1, maxlen, len(chars)))
                # for t, char in enumerate(sentence):
                #     x[0, t, char_indices[char]] = 1.
                #
                # preds = model.predict(x, verbose=0)[0]
                # next_index = sample(preds, diversity)
                # next_char = indices_char[next_index]

                x = np.zeros((1, maxlen), dtype=np.int)
                for t, char in enumerate(sentence):
                    x[0, t] = char_indices[char]

                preds = model.predict(x, verbose=0)[0][0]
                next_index = sample(preds, diversity)
                next_char = indices_char[next_index]

                generated += next_char
                sentence = sentence[1:] + next_char

                sys.stdout.write(next_char)
                sys.stdout.flush()
            print()

def predict(num_to_predict, temperatrue, seed):
    start_index = random.randint(0, len(text) - maxlen - 1)

    if len(seed) > 40:
        print("Type fewer characters, you typed this man characters")
        print(len(seed))
        return 0
    newstring = ""
    space = 40 - len(seed)
    for i in range(space):
        newstring += " "
    seed = newstring + seed
    sentence = seed
    generated = ''
    #sentence = text[start_index: start_index + maxlen]
    generated += sentence
    print('----- Hey Marx, Tell me about:  "' + sentence.strip() + '"')
    sys.stdout.write(generated)

    for i in range(num_to_predict):
        x = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(sentence):
            x[0, t, char_indices[char]] = 1.

        preds = model.predict(x, verbose=0)[0]
        next_index = sample(preds, temperatrue)
        next_char = indices_char[next_index]

        generated += next_char
        sentence = sentence[1:] + next_char

        #sys.stdout.write(next_char)
        #sys.stdout.flush()
    print()
    pprint.pprint(generated)

train()

# ################# PREDICT
# cont = 0
#
# while cont == 0:
#     newtext = input("What do you want me to tell you about?")
#
#     if newtext == "1":
#         cont = 1
#     predict(500, 0.5, newtext.lower())