Dawn-LX/VidSGG-BIG

Box shifting: some boxes may appear as background after tracking (when using dataloader_vidor.py)

Closed this issue · 0 comments

Tips from @Dawn-LX :

This problem originates from

for idx,box_info in enumerate(track_res):
if not isinstance(box_info,list):
box_info = box_info.tolist()
assert len(box_info) == 6 or len(box_info) == 12 + self.dim_boxfeature,"len(box_info)=={}".format(len(box_info))
frame_id = box_info[0]
tid = box_info[1]
tracklet_xywh = box_info[2:6]
xmin_t,ymin_t,w_t,h_t = tracklet_xywh
xmax_t = xmin_t + w_t
ymax_t = ymin_t + h_t
bbox_t = [xmin_t,ymin_t,xmax_t,ymax_t]
confidence = float(0)
if len(box_info) == 12 + self.dim_boxfeature:
confidence = box_info[6]
cat_id = box_info[7]
xywh = box_info[8:12]
xmin,ymin,w,h = xywh
xmax = xmin+w
ymax = ymin+h
bbox = [(xmin+xmin_t)/2, (ymin+ymin_t)/2, (xmax+xmax_t)/2,(ymax+ymax_t)/2]

Here, we notice that tracking results for each box at one specific frame consist of a 6-dim vector or a (12+dim_boxfeature)-dim vector.

  1. If the 6-dim vector appears, corresponding box will be viewed as background.
  2. Otherwise, the first 12-dim of box_info, which consists of frame_id, tracklet_id, 4-dim bbox coordinates, confidence, category_id, 4-dim bbox coordinates, will be used to determine the final location of bbox.

The first 4-dim bbox coordinates (box_info[2:6]) is generated by tracker, and the second one box_info[8:12] is generated by our video obeject detector. The reason why box shift is that we calculate an average bbox coordinates by the two mentioned one. Because detected object location maybe inconsistent with current tracklet, and the tracker-generated one is more precise, so this averaging manner may merge two boxes to a background one.

Specifically, box generated by tracker is much more precise since it considers boxes in previous frames, current detected box, and visual similarity. But box from video object detector maybe wrongly linked to current tracklet (which does not mean it is a background box itself). So this averaging manner is not strictly correct in these cases and that is why we only use track-generated one (box_info[2:6]) in

tracklet_xywh = box_info[2:6]
xmin_t,ymin_t,w_t,h_t = tracklet_xywh
xmax_t = xmin_t + w_t
ymax_t = ymin_t + h_t
confidence = box_info[6]
bbox_t = [xmin_t,ymin_t,xmax_t,ymax_t,confidence]
cat_id = box_info[7]
# xywh = box_info[8:12]
.

However, tracklet_mAP does not improve by switching from averaging manner to unique manner. The reasons maybe

  1. Cases of box shifting are rarely seen, so final performance benefits little from this fixing.
  2. Averaging manner may serve as a more precise way to combine/choose these two kinds of boxes for most cases, so unique manner may lose some accuracy.