alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras

About embedding_weights

chunjoe opened this issue · 8 comments

First, thank you for code sharing.

In w2v.py, I saw your code as follows:

 embedding_weights = [np.array([embedding_model[w] if w in embedding_model
                  else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
                  for w in vocabulary_inv])]

For obtaining weights from embedding_model, parameter w must be a word, e.g. "happy".
But, in w2v.py, "for w in embedding_model ", w is an index of word

Is that a mistake here?
The code "else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)" seems been executed in every loop.

Hi,

for w in vocabulary_inv is list of words, not indexes.

Hi,

I appreciate for your instant reply.

In here, you mentioned that it is dict {int:str}.

In for w in vocabulary_inv , is w a list of words?

Sorry, vocabulary_inv is list of strings, not dict. And w is string (i.e. word)

Sorry to disturb you again. I still feel it is strange...

In sentiment_cnn.py, vocabulary_inv is a dictionary object {int:str}. The vocabulary_inv is inputted to train_word2vec as a part of parameters then.

vocabulary = imdb.get_word_index()
vocabulary_inv = dict((v, k) for k, v in vocabulary.items())
vocabulary_inv[0] = "<PAD/>"

In w2v.py, I don't see where vocabulary_inv is converted to a list type object.
And I added print(type(vocabulary_inv )) in w2v.py. The program printed <class 'dict'> out.

This discrepancy arose after I switched to new [keras] data source. In previous major version data source was data_helpers.load_data() and it returns vocabulary_inv as list. I will fix it when I have more time. Should be dict everywhere

Thank you very much!!!

I wrote the following code. I know that is a little waste of memory...
For the purpose of solving problem , is the code right?

vocabulary_inv_list = [vocabulary_inv[i] for i in range(0, len(vocabulary_inv))]
embedding_weights = [np.array([embedding_model[w] if w in embedding_model
		else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
		for w in vocabulary_inv_list])]

Looks okay. embedding_weights must be a list of len=1 of ndarray with shape=(len(vocabulary_inv), num_features). It was made a list for compatibility with keras layer.set_weights()

Please see updated version