Box shifting: some boxes may appear as background after tracking (when using dataloader_vidor.py)

Tips from @Dawn-LX :

This problem originates from

VidSGG-BIG/dataloaders/dataloader_vidor.py

Lines 488 to 508 in eaf7578

    
           for idx,box_info in enumerate(track_res): 
        
               if not isinstance(box_info,list): 
        
                   box_info = box_info.tolist() 
        
               assert len(box_info) == 6 or len(box_info) == 12 + self.dim_boxfeature,"len(box_info)=={}".format(len(box_info)) 
        
               frame_id = box_info[0] 
        
               tid = box_info[1] 
        
               tracklet_xywh = box_info[2:6] 
        
               xmin_t,ymin_t,w_t,h_t = tracklet_xywh 
        
               xmax_t = xmin_t + w_t 
        
               ymax_t = ymin_t + h_t 
        
               bbox_t = [xmin_t,ymin_t,xmax_t,ymax_t] 
        
               confidence = float(0) 
        
               if len(box_info) == 12 + self.dim_boxfeature: 
        
                   confidence = box_info[6] 
        
                   cat_id = box_info[7] 
        
                   xywh = box_info[8:12] 
        
                   xmin,ymin,w,h = xywh 
        
                   xmax = xmin+w 
        
                   ymax = ymin+h 
        
                   bbox = [(xmin+xmin_t)/2, (ymin+ymin_t)/2, (xmax+xmax_t)/2,(ymax+ymax_t)/2]

Here, we notice that tracking results for each box at one specific frame consist of a 6-dim vector or a (12+dim_boxfeature)-dim vector.

If the 6-dim vector appears, corresponding box will be viewed as background.
Otherwise, the first 12-dim of box_info, which consists of frame_id, tracklet_id, 4-dim bbox coordinates, confidence, category_id, 4-dim bbox coordinates, will be used to determine the final location of bbox.

The first 4-dim bbox coordinates (box_info[2:6]) is generated by tracker, and the second one box_info[8:12] is generated by our video obeject detector. The reason why box shift is that we calculate an average bbox coordinates by the two mentioned one. Because detected object location maybe inconsistent with current tracklet, and the tracker-generated one is more precise, so this averaging manner may merge two boxes to a background one.

Specifically, box generated by tracker is much more precise since it considers boxes in previous frames, current detected box, and visual similarity. But box from video object detector maybe wrongly linked to current tracklet (which does not mean it is a background box itself). So this averaging manner is not strictly correct in these cases and that is why we only use track-generated one (box_info[2:6]) in

VidSGG-BIG/dataloaders/dataloader_vidor_v3.py

Lines 414 to 421 in eaf7578

    
           tracklet_xywh = box_info[2:6] 
        
           xmin_t,ymin_t,w_t,h_t = tracklet_xywh 
        
           xmax_t = xmin_t + w_t 
        
           ymax_t = ymin_t + h_t 
        
           confidence = box_info[6] 
        
           bbox_t = [xmin_t,ymin_t,xmax_t,ymax_t,confidence] 
        
           cat_id = box_info[7] 
        
           # xywh = box_info[8:12]

.

However, tracklet_mAP does not improve by switching from averaging manner to unique manner. The reasons maybe

Cases of box shifting are rarely seen, so final performance benefits little from this fixing.
Averaging manner may serve as a more precise way to combine/choose these two kinds of boxes for most cases, so unique manner may lose some accuracy.

	for idx,box_info in enumerate(track_res):
	if not isinstance(box_info,list):
	box_info = box_info.tolist()
	assert len(box_info) == 6 or len(box_info) == 12 + self.dim_boxfeature,"len(box_info)=={}".format(len(box_info))

	frame_id = box_info[0]
	tid = box_info[1]
	tracklet_xywh = box_info[2:6]
	xmin_t,ymin_t,w_t,h_t = tracklet_xywh
	xmax_t = xmin_t + w_t
	ymax_t = ymin_t + h_t
	bbox_t = [xmin_t,ymin_t,xmax_t,ymax_t]
	confidence = float(0)
	if len(box_info) == 12 + self.dim_boxfeature:
	confidence = box_info[6]
	cat_id = box_info[7]
	xywh = box_info[8:12]
	xmin,ymin,w,h = xywh
	xmax = xmin+w
	ymax = ymin+h
	bbox = [(xmin+xmin_t)/2, (ymin+ymin_t)/2, (xmax+xmax_t)/2,(ymax+ymax_t)/2]