how to get the video level "weak" label

Question

how to get the video level "weak" label

Opened this issue 6 years ago · 5 comments

Dear Mr. Gao
Thank you so much for the great work. However, I met some problems when I implemented this code.
As described in you article, "For the visual frames, we use an ImageNet pre-trained ResNet-152 network [34] to make object category predictions, and we max-pool over predictions of all frames to obtain a video-level prediction. The top labels (with class probability larger than a threshold = 0.3) are used as weak \labels" for the unlabeled video."
However, when I use the pre-trained-152 network, I can get the only one category prediction lager than the threshold. How can I get multi-labels through the pre-trained-152 network.
Should I train a object detection network or a multi-classes multi-labels network or some other solutions. Thank you for your assistance
Best regards!

Answer 1 · 2018-11-13T13:12:10.000Z

Hi,

We didn't use all 1000 imagenet classes, but ~20 selected audio-related classes. Then we normalize the class probabilities for these classes, so you could get multiple labels with class probability larger than the threshold. Also, 0.3 is just empirical.

Thanks for your interest!

Answer 2 · 2018-11-13T13:28:06.000Z

@rhgao
Thanks for your reply! I will try it

Answer 3 · 2018-12-01T12:14:36.000Z

Dear Mr. Gao
Thank you so much for the great work. However, I met some problems when I implemented this code.
As described in you paper, "we collect a maximum of 3,000 basis vectors for each object category." " In other words, we concatenate the basis vectors learnt for each detected object to construct the basis dictionary W(q). Next, in the NMF algorithm, we hold W(q) fixed, and only estimate activation H(q) with multiplicative update rules.
However, what's the shape of the selected W(q)(j) ? It is also MXK (K=25)? And how do you selected K basis vectors from the 3000 stored basis vectors

Answer 4 · 2018-12-03T15:51:52.000Z

Hi, We use all the collected basis vectors to initialize W, namely M x K with M = 3000, K=25. 3,000 is just a hyperparameter, and a larger number of basis vectors could potentially lead to better results.

Answer 5 · 2018-12-08T09:17:46.000Z

Thanks, cloud you please give me your train loss/mAp ,and val loss/mAp. my train loss is about 0.0001, train Map is about 0.72. My val loss is about 0.1 and val mAp is 0.65 after 300 iter, batchSize and Valsize is the same of you. Is that normal?