[Paper] | ACM MM23
In Video-Language (VL) learning tasks, a massive amount of text annotations are describing geometrical relationships of instances (e.g., 19.6% to 45.0% in MSVD, MSR-VTT, MSVD-QA, and MSVRTTQA), which often become the bottleneck of the current VL tasks (e.g., 60.8% vs. 98.2% CIDEr in MSVD for geometrical and non-geometrical annotations). Considering the rich spatial information of depth map, an intuitive way is to enrich the conventional 2D visual representations with depth information through current SOTA models, i.e., transformer. However, it is cumbersome to compute the self-attention on a long-range sequence and heterogeneous video-level representations with regard to computation cost and flexibility on various frame scales. To tackle this, we propose a hierarchical transformer, termed Depth-Aware Sparse Transformer (DAST).