Datasets in .pkl format?
jazzsaxmafia opened this issue · 40 comments
Hello, thank you for sharing this great project.
I would like to run the code, but it seems like the project does not contain the datasets used. Even though I can get flickr or coco dataset but I do not know how the data is preprocessed in those .pkl files.
Can I possibly get the data as it is used in the project?
Thank you.
Hey, thanks for your question. Unfortunately, the preprocessed datasets are still quite large so we have no resources to host all of them at the moment. What we can do however is add some preprocessing instructions so that you can extract the same features using an open source tool. We will try to do so in the next few days.
@jazzsaxmafia @kelvinxu I've also encountered the same problem.
@kelvinxu Could you provide some simple information about the pkl file? For example, what's in the pkl file and their structures. Thank you very much. The preprocessing instruction is nicer if it won't take too long.
@leo-zhou Just reference, it's my tentative guess.
data.pkl -> first dump: cap
, second dump: feat
cap
-> [[sentence, feature #], ... ]
feat
-> numpy.array(vgg_conv4).shape[N, 1, L x D]
scipy.sparse.csr_matrix(vgg_conv4).shape[N, L x D]
dictionary.pkl -> [[word, word #], ... ]
{word: word#, ... }
(word # starts from 2, (= frequency rank + 1), 0 and 1 are reserved by the program.)
Updated based upon other comments.
@kyunghyuncho Oh, I got it. This is so why you've used ff.todense()
in coco.py:38
. Thanks!
@jnhwkim @kyunghyuncho Thanks a lot !
Thank you very much. I think those were enough for me to set the data myself
@jnhwkim, very minor addition to prevent confusion is that the dictionary.pkl doesn't load a list but a python dictionary in the form {word : word #}
. This is probably what you meant. Thanks!
Any news on the preprocessing instructions or even the preprocessed datasets upload? Great library, bit improved documentation would be welcome though.
Hey samim23,
The feature extraction procedure was described in the paper (you should extract conv5_4), but I agree that it should be explained reproduced somewhere here in the repo.
Has anyone gotten the dataset conversion working? If so, it would be great if you could share the code. Will be trying this myself as well.
@asampat3090 I saw you have implemented the code of dataset conversion. Can you reproduce the results in Kelvin's paper? Thanks.
Hey guys, anyone succeeded in generating the pkl file? Any link would be very helpful! Thank you.
@cxj273 I haven't actually tried. I'll try this weekend. @ffmpbgrnn check out my code - I have a generator for the flickr_30k, but I haven't documented much though.
@asampat3090 I will have a look. Many thanks! :-)
@asampat3090 Would your code actually work though? The image ids refer to the whole image collection, whereas you point to an image feature in a subset using the index that is meant for the whole image collection. Or am I missing something?
I'm trying to port your code to the COCO dataset.
@asampat3090 From my understanding, line 54 is wrong. You can't get all the training captions using the training image idx. Correct me if I am wrong.
Hi, can I ask how large those .pkl files are? I tried to make them for the MSCOCO dataset, and the features from VGG for the training set alone take around 75GB. I stored them in scipy.sparse.csr_matrix. According to coco.py, it seems they all get loaded into memory together, so I was wondering if there is anything I was missing...
@xlhdh It should be something around 15 Gbs. They are all loaded into memory at once, but we unsparsifying them one batch at a time. Are you unsparsifying them all at once?
@kelvinxu The original weights were around 15GB, but once I pickle them, they got to like 75... And they were csr_matrix from top to toe. I guess I'll look at it again to see if there's any bug!
It's likely because you didn't use "protocol=cPickle.HIGHEST_PROTOCOL" as
an argument with cPickle.dump.
- K
On Wed, Nov 4, 2015 at 10:54 AM, Yizhou Hu notifications@github.com wrote:
@kelvinxu https://github.com/kelvinxu The original weights were around
15GB, but once I pickle them, they got to like 75... And they were
csr_matrix from top to toe. I guess I'll look at it again to see if there's
any bug!—
Reply to this email directly or view it on GitHub
#1 (comment)
.
@kyunghyuncho Thank you, I used the highest protocol (I thought that was default) and it worked! The only thing I wasn't able to do was to dump the image features to disk all at once so I had to read several files in and assemble them in memory.
@cxj273 @gamer13 sorry for the delay, I'm not sure I quite understood the issue. So I suppose there might be a mismatch between the "features" and "caps" variables in "prepare_data" here, but if I understand correctly you're saying we would need to re-index all of the image ids? If so, did you guys have any success doing that? I'm still trying to figure that out.
UPDATE: I believe I have reindexed it such that features are referenced properly. Does anyone else have working code?
@asampat3090 Thank you for your effort for sharing your script. I had trouble running this model and your code was very helpful. I am still struggling. But here is my suggestions for your code.
Suggestions
- It seems like the capgen.py train function requires dictionary with 'A' and it means in your vectorizer code, the options should include following
- vectorizer = CountVectorizer(analyzer=str.split, lowercase=False).fit(captions)
- Or you could make sentence in caps file as lower case. This would be better approach for the data sparseness problem.
- You used conv5_3 as the feature extraction. However according to the paper, con5_4 seems better features.
Thanks.
@kyunghyuncho , @kelvinxu
Hello,
By using default parameters for Coco, soft attention, I get much lower results on the test set than what was published: BLEU-1=0.545, METEOR=0.164, CIDer=0.274.
The only difference I see is that early stop is done on NLL. Can this cause such big gaps?
Also, my coco_align.train.pkl is about 6 GB, and not 15 GB.
Thanks!
I observe that in function prepare_data()
line 40 of file flickr30k.py
, the code set all words with id larger than n_words
to be 1(UNK). Therefore when we create the dictionary, we should assign id in descending order of word frequency, assign smaller id for words with larger frequency.
@asampat3090 In your make_flickr_data.py
, you used CountVectorizer
from scikit.learn
, which assigns word id in occurrence order. This might be the reason why there are too many UNK in training data.
Yes, the dictionary has IDs in descending frequency order.
Any idea about why I'm getting so much lower metrics on Coco (see my comment above)?
Thanks.
Hey all, I've created a script that appears to work for preprocessing. The source is
here. It does everything besides create the word-ID dictionary.
Thanks @rowanz
What metrics values do you get, for example for Coco, soft attention?
Hi @intuinno, I'm trying to run your prepare_caffe_and_dictionary_coco.ipynb. Could you please explain what the file dataset_coco.json is?
I forked @intuinno 's work and added some codes and a simple doc in README.md . (No need dataset_coco.json)
https://github.com/Lorne0/arctic-captions
Hope it's helpful.
Just run this one line script to generate file dictionary.pkl
.
cat flickr8k/Flickr8k_text/Flickr8k.token.txt | awk -F '\t' '{print $2}' | awk '{for(i=1;i<=NF;i++) print $i}' | sort | uniq -c | sort -nr | awk '{print $2,NR+1}' | python -c "import sys; import cPickle as pkl; pkl.dump(dict([line.strip('\n').split(' ') for line in sys.stdin.readlines()]), open('features/dictionary.pkl', 'wb'))"
Hello @Lorne0 , thank you so much for your code. It helps me a lot to reproduce the project.
I am looking into the code and have a question, for the preprocess.sh, why do the crop 224_224 instead of stay 256_256?
Thanks
Hi, @athenspeterlong. Because the pretrained CNN requires 224*224 input, we should crop the images at first to feed them to CNN.
Hi ,@intuinno , thank you for sharing the preprocessing code. I am using Flicker8k dataset and was able to build the necessary .pkl files and dictionary using prepare_flickr8k.py.
Now I am trying to run the train function using evaluate_flickr8k.py but I am getting "coo_matrix object does not support indexing" in flickr8k.py line no 16 ..
Any idea why this is happening ..
Thanks
@Lorne0, I have tried to reproduce your results using your code,when I run the prepare_model_coco.py,some errors happen:
val(5000,100352)
train (5000,100352)
Traceback (most recent call last):
File "prepare_model_coco.py", line 70, in
result = np.empty((numImage, 100352))
MemoryError
I don't know why it happens?
thanks