A question about multiple labels for some frames

In the annotation file, more than one kind of action happens in one frame. For example, in sequence 46GP8, c092 and c147 are overlapping between 11.9s and 12.6s. So frames between this time period seem to have multiple labels.

46GP8,HR43,Kitchen,6,7,Yes,A person cooking on a stove while watching something out a window.,food;stove;window,A person cooks food on a stove before looking out of a window.,c092 11.90 21.20;c147 0.00 12.60,24.83

But the data loader you implemented seems to ignore it? Does it mean some frames will be appended to image_paths for multiple times?

CrossEntropyLoss will be confused as some frames have 2+ different labels?

charades-algorithms/pytorch/datasets/charadesrgb.py

Lines 112 to 119 in 927794c

    
           for x in label: 
        
               for ii in range(0, n-1, GAP): 
        
                   if x['start'] < ii/float(FPS) < x['end']: 
        
                       impath = '{}/{}-{:06d}.jpg'.format( 
        
                           iddir, vid, ii+1) 
        
                       image_paths.append(impath) 
        
                       targets.append(cls2int(x['class'])) 
        
                       ids.append(vid)

You are correct. The initial Charades models in fact trained with a Softmax loss, assuming a single label for each frame, which is often a bad approximation! It was just easier to train than a sigmoid loss. Therefore, if a frame has two labels at the same time, two separate training pairs are created (frame, label_1) and (frame, label_2). Newer Charades models use a proper sigmoid loss.

You can check out the temporal-fields pytorch repo for an algorithm that uses sigmoid loss.

Hope that helps!

Thank you for your reply :)

Another question is, if I want to calculate mAP for each frame, should I use Charades_v1_classify.m or Charades_v1_localization.m? What's the difference between them?

Charades_v1_classify.m evaluates a single prediction for each video.

Charades_v1_localization.m evaluation a prediction for each frame in the video. (That is, 25 frames uniformly across the video)

So you should use charades_v1_localization for calculating mAP for each frame.

Hope that helps!

Thank you 😊

	for x in label:
	for ii in range(0, n-1, GAP):
	if x['start'] < ii/float(FPS) < x['end']:
	impath = '{}/{}-{:06d}.jpg'.format(
	iddir, vid, ii+1)
	image_paths.append(impath)
	targets.append(cls2int(x['class']))
	ids.append(vid)