Datasets in .pkl format?

Question

Datasets in .pkl format?

jazzsaxmafia opened this issue 10 years ago · 40 comments

Hello, thank you for sharing this great project.

I would like to run the code, but it seems like the project does not contain the datasets used. Even though I can get flickr or coco dataset but I do not know how the data is preprocessed in those .pkl files.

Can I possibly get the data as it is used in the project?

Thank you.

Answer 1 · 2015-05-25T06:52:12.000Z

Hey, thanks for your question. Unfortunately, the preprocessed datasets are still quite large so we have no resources to host all of them at the moment. What we can do however is add some preprocessing instructions so that you can extract the same features using an open source tool. We will try to do so in the next few days.

Answer 2 · 2015-05-27T00:54:09.000Z

@jazzsaxmafia @kelvinxu I've also encountered the same problem.

Answer 3 · 2015-05-28T02:15:04.000Z

@kelvinxu Could you provide some simple information about the pkl file? For example, what's in the pkl file and their structures. Thank you very much. The preprocessing instruction is nicer if it won't take too long.

Answer 4 · 2015-05-28T07:16:16.000Z

@leo-zhou Just reference, it's my tentative guess.
data.pkl -> first dump: cap, second dump: feat
cap -> [[sentence, feature #], ... ]
feat-> ~~numpy.array(vgg_conv4).shape[N, 1, L x D]~~ scipy.sparse.csr_matrix(vgg_conv4).shape[N, L x D]

dictionary.pkl -> ~~[[word, word #], ... ]~~ {word: word#, ... }
(word # starts from 2, (= frequency rank + 1), 0 and 1 are reserved by the program.)

Updated based upon other comments.

Answer 5 · 2015-05-28T11:59:00.000Z

@leo-zhou
@jnhwkim is correct, except that feat is saved as a sparse matrix of shape [N, 14 * 14 * 512]. In the dictionary (dictionary.pkl), 0 and 1 are reserved for the end of caption and a unknown word.

Answer 6 · 2015-05-28T12:01:38.000Z

@kyunghyuncho Oh, I got it. This is so why you've used ff.todense() in coco.py:38. Thanks!

Answer 7 · 2015-05-28T12:05:44.000Z

@jnhwkim @kyunghyuncho Thanks a lot !

Answer 8 · 2015-05-28T12:36:30.000Z

Thank you very much. I think those were enough for me to set the data myself

Answer 9 · 2015-05-28T20:26:17.000Z

@jnhwkim, very minor addition to prevent confusion is that the dictionary.pkl doesn't load a list but a python dictionary in the form {word : word #}. This is probably what you meant. Thanks!

Answer 10 · 2015-05-28T22:29:51.000Z

@kelvinxu Yes, you're right. For preventing confusion, I'll update my comment.

Answer 11 · 2015-07-23T06:52:37.000Z

Any news on the preprocessing instructions or even the preprocessed datasets upload? Great library, bit improved documentation would be welcome though.

Answer 12 · 2015-07-24T05:56:00.000Z

Hey samim23,

The feature extraction procedure was described in the paper (you should extract conv5_4), but I agree that it should be explained reproduced somewhere here in the repo.

Answer 13 · 2015-08-17T21:26:45.000Z

Has anyone gotten the dataset conversion working? If so, it would be great if you could share the code. Will be trying this myself as well.

Answer 14 · 2015-09-17T19:37:18.000Z

@asampat3090 I saw you have implemented the code of dataset conversion. Can you reproduce the results in Kelvin's paper? Thanks.

Answer 15 · 2015-09-17T19:38:01.000Z

Hey guys, anyone succeeded in generating the pkl file? Any link would be very helpful! Thank you.

Answer 16 · 2015-09-17T19:54:07.000Z

@cxj273 I haven't actually tried. I'll try this weekend. @ffmpbgrnn check out my code - I have a generator for the flickr_30k, but I haven't documented much though.

Answer 17 · 2015-09-17T19:58:27.000Z

@asampat3090 I will have a look. Many thanks! :-)

Answer 18 · 2015-10-01T08:31:23.000Z

@asampat3090 Would your code actually work though? The image ids refer to the whole image collection, whereas you point to an image feature in a subset using the index that is meant for the whole image collection. Or am I missing something?

I'm trying to port your code to the COCO dataset.

Answer 19 · 2015-10-10T18:51:18.000Z

@asampat3090 From my understanding, line 54 is wrong. You can't get all the training captions using the training image idx. Correct me if I am wrong.

Answer 20 · 2015-11-04T06:52:03.000Z

Hi, can I ask how large those .pkl files are? I tried to make them for the MSCOCO dataset, and the features from VGG for the training set alone take around 75GB. I stored them in scipy.sparse.csr_matrix. According to coco.py, it seems they all get loaded into memory together, so I was wondering if there is anything I was missing...

Answer 21 · 2015-11-04T07:30:31.000Z

@xlhdh It should be something around 15 Gbs. They are all loaded into memory at once, but we unsparsifying them one batch at a time. Are you unsparsifying them all at once?

Answer 22 · 2015-11-04T15:54:52.000Z

@kelvinxu The original weights were around 15GB, but once I pickle them, they got to like 75... And they were csr_matrix from top to toe. I guess I'll look at it again to see if there's any bug!

Answer 23 · 2015-11-04T16:51:55.000Z

It's likely because you didn't use "protocol=cPickle.HIGHEST_PROTOCOL" as
an argument with cPickle.dump.

K

On Wed, Nov 4, 2015 at 10:54 AM, Yizhou Hu notifications@github.com wrote:

@kelvinxu https://github.com/kelvinxu The original weights were around
15GB, but once I pickle them, they got to like 75... And they were
csr_matrix from top to toe. I guess I'll look at it again to see if there's
any bug!

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

Answer 24 · 2015-11-06T00:23:21.000Z

@kyunghyuncho Thank you, I used the highest protocol (I thought that was default) and it worked! The only thing I wasn't able to do was to dump the image features to disk all at once so I had to read several files in and assemble them in memory.

Answer 25 · 2015-11-09T03:01:43.000Z

@cxj273 @gamer13 sorry for the delay, I'm not sure I quite understood the issue. So I suppose there might be a mismatch between the "features" and "caps" variables in "prepare_data" here, but if I understand correctly you're saying we would need to re-index all of the image ids? If so, did you guys have any success doing that? I'm still trying to figure that out.

UPDATE: I believe I have reindexed it such that features are referenced properly. Does anyone else have working code?

Answer 26 · 2015-11-29T22:49:42.000Z

@asampat3090 Thank you for your effort for sharing your script. I had trouble running this model and your code was very helpful. I am still struggling. But here is my suggestions for your code.

Suggestions

It seems like the capgen.py train function requires dictionary with 'A' and it means in your vectorizer code, the options should include following
- vectorizer = CountVectorizer(analyzer=str.split, lowercase=False).fit(captions)
- Or you could make sentence in caps file as lower case. This would be better approach for the data sparseness problem.
You used conv5_3 as the feature extraction. However according to the paper, con5_4 seems better features.

Thanks.

Answer 27 · 2016-01-28T07:14:56.000Z

@kyunghyuncho , @kelvinxu
Hello,
By using default parameters for Coco, soft attention, I get much lower results on the test set than what was published: BLEU-1=0.545, METEOR=0.164, CIDer=0.274.
The only difference I see is that early stop is done on NLL. Can this cause such big gaps?
Also, my coco_align.train.pkl is about 6 GB, and not 15 GB.
Thanks!

Answer 28 · 2016-01-28T07:32:19.000Z

I observe that in function prepare_data() line 40 of file flickr30k.py, the code set all words with id larger than n_words to be 1(UNK). Therefore when we create the dictionary, we should assign id in descending order of word frequency, assign smaller id for words with larger frequency.
@asampat3090 In your make_flickr_data.py, you used CountVectorizer from scikit.learn, which assigns word id in occurrence order. This might be the reason why there are too many UNK in training data.

Answer 29 · 2016-02-03T13:32:26.000Z

Yes, the dictionary has IDs in descending frequency order.
Any idea about why I'm getting so much lower metrics on Coco (see my comment above)?
Thanks.

Answer 30 · 2016-02-16T05:56:39.000Z

Hey all, I've created a script that appears to work for preprocessing. The source is
here. It does everything besides create the word-ID dictionary.

Answer 31 · 2016-02-16T06:53:40.000Z

Thanks @rowanz
What metrics values do you get, for example for Coco, soft attention?

Answer 32 · 2016-02-25T22:27:58.000Z

Hello, Thank you @rowanz . I myself struggled for creation of the preprocessing and created a repo for anybody needs it. You can check it at here

Answer 33 · 2016-03-21T22:42:07.000Z

Hi @intuinno, I'm trying to run your prepare_caffe_and_dictionary_coco.ipynb. Could you please explain what the file dataset_coco.json is?

Answer 34 · 2016-05-13T12:24:40.000Z

I forked @intuinno 's work and added some codes and a simple doc in README.md . (No need dataset_coco.json)
https://github.com/Lorne0/arctic-captions
Hope it's helpful.

Answer 35 · 2016-05-14T14:40:08.000Z

Just run this one line script to generate file dictionary.pkl.
cat flickr8k/Flickr8k_text/Flickr8k.token.txt | awk -F '\t' '{print $2}' | awk '{for(i=1;i<=NF;i++) print $i}' | sort | uniq -c | sort -nr | awk '{print $2,NR+1}' | python -c "import sys; import cPickle as pkl; pkl.dump(dict([line.strip('\n').split(' ') for line in sys.stdin.readlines()]), open('features/dictionary.pkl', 'wb'))"

Answer 36 · 2016-05-17T00:04:30.000Z

Hello @Lorne0 , thank you so much for your code. It helps me a lot to reproduce the project.
I am looking into the code and have a question, for the preprocess.sh, why do the crop 224_224 instead of stay 256_256?
Thanks

Answer 37 · 2016-05-22T08:44:52.000Z

Hi, @athenspeterlong. Because the pretrained CNN requires 224*224 input, we should crop the images at first to feed them to CNN.

Answer 38 · 2016-08-09T19:55:03.000Z

Hi ,@intuinno , thank you for sharing the preprocessing code. I am using Flicker8k dataset and was able to build the necessary .pkl files and dictionary using prepare_flickr8k.py.
Now I am trying to run the train function using evaluate_flickr8k.py but I am getting "coo_matrix object does not support indexing" in flickr8k.py line no 16 ..

Any idea why this is happening ..

Thanks

Answer 39 · 2016-09-08T14:24:54.000Z

@Lorne0, I have tried to reproduce your results using your code，when I run the prepare_model_coco.py，some errors happen：
val（5000,100352）
train （5000，100352）
Traceback (most recent call last):
File "prepare_model_coco.py", line 70, in
result = np.empty((numImage, 100352))
MemoryError
I don't know why it happens？
thanks

Answer 40 · 2018-10-04T10:59:43.000Z

@rowanz Thanks for ur preprocessing code using pkl files. In your code I can see that the training imgs include restval set also. Is that recommended by the author @kelvinxu! Can you clarify the reason behind it?
It will be really helpful for me. Thanks in advance