antoyang/TubeDETR

About m_sIoU

zanglam opened this issue · 1 comments

Hi, thank you for your excellent work! I have a question about the m_sIoU reported in your paper.
We can estimate the spatial grounding accuracy inside the predicted time span (t_s, t_e) by calculating m_vIoU / m_tIoU. But I observed that in your model, m_sIoU << m_vIoU / m_tIoU (e.g., for HC-STVG2.0 with resolution 352 and temporal stride 4, m_sIoU =0.649, m_vIoU / m_tIoU = 0.467 / 0.539 = 0.866). It means that for the frames that are not in the predicted time span (t_s, t_e), the IoU between the predicted bounding boxes and the ground truth boxes is very low. This is quite interesting for me. Could you provide some analysis/explanations on it?

As mentioned in issue #3, the calculated viou is higher than the correct one, but the tiou (didn't use the incorrect max_end to calculate this in the code) and siou are correctly calculated, so m_sIoU << m_vIoU / m_tIoU.