queries about the nuswide_wordvec text file

Question

queries about the nuswide_wordvec text file

zhangzeng97 opened this issue 7 years ago · 10 comments

Hi Cao Yue,

Great thanks for your great job done on the DVSQ project. I am currently working on my project at school. It has helped me a lot.

I have successfully deployed the whole project. However, when I tried to run it with my own dataset, confusions arise. May I ask where the wordvec file in the data folder comes from? I have read your paper about the transformer to convert the image representations to embedding labels. However, it does not seem relevant to this file. May I ask how I can generate the word vectors and should which dataset be converted to word vectors.

Thank you.

Best,
Zhang Zeng

Answer 1 · 2017-11-24T07:27:59.000Z

Hi Zeng, Here we use word2vec model pretrained on GoogleNews Dataset (e.g. https://github.com/mmihaltz/word2vec-GoogleNews-vectors), to extract the word embeddings for the labels of images, e.g. dog, cat and so on. Best, Yue 2017-11-24 15:18 GMT+08:00 zhangzeng97 <notifications@github.com>:

…

Hi Cao Yue, Great thanks for your great job done on the DVSQ project. I am currently working on my project at school. It has helped me a lot. I have successfully deployed the whole project. However, when I tried to run it with my own dataset, confusions arise. May I ask where the wordvec file in the data folder comes from? I have read your paper about the transformer to convert the image representations to embedding labels. However, it does not seem relevant to this file. May I ask how I can generate the word vectors and should which dataset be converted to word vectors. Thank you. Best, Zhang Zeng — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHOCwQrOzs0w3L9RlzILkFvL2SNlQOSDks5s5m3CgaJpZM4Qpc8H> .

-- Best Regards, Yue Cao Address: Room 11-419 East Main Building, Tsinghua University, Beijing, 100084 P.R. CHINA Mobile: (86)15201519264 E-mail: caoyue10@gmail.com

Answer 2 · 2017-11-27T02:09:56.000Z

Hi Cao Yue,

Great thanks for your fast reply!

I have looked into that and it helps a lot.

Best,
Zhang Zeng

Answer 3 · 2017-11-28T07:12:20.000Z

Hi,

May I ask something about the paper itself here?

I have read through it several times, but there are some points that I cannot understand. Like the word embedding for the labels, may I ask why do we need this? I tried to print the output of validation but it is the 81-dimensional label instead of the 300-dimensional word embeddings.

Thanks a lot:)

Best,
Zhang Zeng

Answer 4 · 2017-11-28T08:13:42.000Z

Hi Zeng，
You are right, the label itself is 81-dimensional because nuswide is a 81-class dataset, and the word embedding of a single label is 300-dimensional.
Actually, because Nuswide is a multi-label dataset, the label representation of an image is a matrix of 81 * 300 dimensional(not just a vector of 81 or 300 dimension). Specifically, the ith row is the word embedding of label i if the image has label i, otherwise, the ith row will be all zero.(You can prove this by looking at the line 322 of file "net.py").

Answer 5 · 2017-11-28T10:59:22.000Z

Hi Bin,

Thank you so much for your fast reply!

I have gone through it again. May I ask what the codebook C mentioned in the section 3.2 of the paper? My understanding is that for 81 classes, each class contains K centers. And if the C here is the same C in the line 68 of the net_val.py file?
I tried to print out the self.C from the model. It shows that it is a 1024 x 300 tensor. In my opinion, the 300 represents the 300-dimensional vectors while I am not so sure where the 1024 comes from.

Best,
Zeng

Answer 6 · 2017-12-09T15:17:49.000Z

1024 = n_subcenter(256) * n_subspace(4).
Sorry for my late reply.

Answer 7 · 2018-11-01T10:28:19.000Z

I have got the GoogleNews-vectors-negative300.bin and I wonder how to get the word2vec.txt in cifar10 dataset

Answer 8 · 2018-11-01T11:18:41.000Z

You can use gensim to load the model and extract wordvector. Here is a tutorial.

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print(model['car'])

Answer 9 · 2018-11-02T12:48:29.000Z

Thanks for your help. However, I just try model['airplane', ...](include the 10 class of cifar10) and get the .txt which is wrong. I hope to know how to get the correct wordvector.

Answer 10 · 2018-11-03T01:24:16.000Z

I just download the pre-trained word-vector and it works. So maybe you need to check your "GoogleNews-vectors-negative300.bin".

For reference, here is the pre-trained word-vector I use:
wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"