About embedding_weights

Question

About embedding_weights

chunjoe opened this issue 7 years ago · 8 comments

First, thank you for code sharing.

In w2v.py, I saw your code as follows:

 embedding_weights = [np.array([embedding_model[w] if w in embedding_model
                  else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
                  for w in vocabulary_inv])]

For obtaining weights from embedding_model, parameter w must be a word, e.g. "happy".
But, in w2v.py, "for w in embedding_model ", w is an index of word

Is that a mistake here?
The code "else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)" seems been executed in every loop.

Answer 1 · 2017-05-30T16:54:59.000Z

Hi,

for w in vocabulary_inv is list of words, not indexes.

Answer 2 · 2017-05-30T17:04:04.000Z

Hi,

I appreciate for your instant reply.

In here, you mentioned that it is dict {int:str}.

In for w in vocabulary_inv , is w a list of words?

Answer 3 · 2017-05-30T18:14:07.000Z

Sorry, vocabulary_inv is list of strings, not dict. And w is string (i.e. word)

Answer 4 · 2017-05-31T02:33:43.000Z

Sorry to disturb you again. I still feel it is strange...

In sentiment_cnn.py, vocabulary_inv is a dictionary object {int:str}. The vocabulary_inv is inputted to train_word2vec as a part of parameters then.

vocabulary = imdb.get_word_index()
vocabulary_inv = dict((v, k) for k, v in vocabulary.items())
vocabulary_inv[0] = "<PAD/>"

In w2v.py, I don't see where vocabulary_inv is converted to a list type object.
And I added print(type(vocabulary_inv )) in w2v.py. The program printed <class 'dict'> out.

Answer 5 · 2017-05-31T07:05:51.000Z

This discrepancy arose after I switched to new [keras] data source. In previous major version data source was data_helpers.load_data() and it returns vocabulary_inv as list. I will fix it when I have more time. Should be dict everywhere

Answer 6 · 2017-05-31T07:21:04.000Z

Thank you very much!!!

I wrote the following code. I know that is a little waste of memory...
For the purpose of solving problem , is the code right?

vocabulary_inv_list = [vocabulary_inv[i] for i in range(0, len(vocabulary_inv))]
embedding_weights = [np.array([embedding_model[w] if w in embedding_model
		else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
		for w in vocabulary_inv_list])]

Answer 7 · 2017-05-31T08:31:08.000Z

Looks okay. embedding_weights must be a list of len=1 of ndarray with shape=(len(vocabulary_inv), num_features). It was made a list for compatibility with keras layer.set_weights()

Answer 8 · 2017-06-08T13:32:35.000Z

Please see updated version