/CVPR20PaperReading

Here is a summary of Vision and Language tasks on CVPR 2020, please enjoy!

Image/Video Captioning

  1. Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs oral Shizhe Chen

    details image-20200624110722000
  2. Context-Aware Group Captioning via Self-Attention and Contrastive Features

    details image-20200624112317436
  3. More Grounded Image Captioning by Distilling Image-Text Matching Model

    details image-20200624112539362
  4. Show, Edit and Tell: A Framework for Editing Image Captions

    details image-20200624112638238
  5. Normalized and Geometry-Aware Self-Attention Network for Image Captioning

    details image-20200624112756931
  6. Meshed-Memory Transformer for Image Captioning

    details image-20200624112920186
  7. Better Captioning With Sequence-Level Exploration JinQin

    details
  8. X-Linear Attention Networks for Image Captioning JD AI

    details image-20200624113117197
  9. Transform and Tell: Entity-Aware News Image Captioning

    details image-20200624113356010
  10. Syntax-Aware Action Targeting for Video Captioning Dacheng Tao

    details image-20200624113306744
  11. Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

    details image-20200624113611241
  12. Object Relational Graph With Teacher-Recommended Learning for Video Captioning

    details image-20200624113514801

Image/Video-Text

  1. ActBERT: Learning Global-Local Video-Text Representations oral

    details image-20200624113926481
  2. Context-Aware Attention Network for Image-Text Retrieval

    details image-20200624114208609
  3. Graph Structured Network for Image-Text Matching

    details image-20200624114356820
  4. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

    details image-20200624114647836
  5. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning Shizhe Chen

    details image-20200624114705646
  6. VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

    details image-20200624114750147
  7. 12-in-1: Multi-Task Vision and Language Representation Learning Jiasen Lu

    details image-20200624150911539

VQA

  1. Counterfactual Vision and Language Learning oral

    details image-20200624141414831
  2. TA-Student VQA: Multi-Agents Training by Self-Questioning oral

    details image-20200624120549179
  3. SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions oral

    details image-20200624141520520
  4. Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA oral

    details image-20200624141624314
  5. Hierarchical Conditional Relation Networks for Video Question Answering oral

    details image-20200624142033086
  6. In Defense of Grid Features for Visual Question Answering

    details image-20200624145118917
  1. VQA with No Questions-Answers Training

    details image-20200624150753960
  1. Counterfactual Samples Synthesizing for Robust Visual Question Answering

    details image-20200624151351076

REC/RES

  1. Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation oral

    details image-20200624142136971
  2. Graph-Structured Referring Expression Reasoning in the Wild oral

    details image-20200624142223347
  3. Visual-textual Capsule Routing for Text-based Video Segmentation oral

    details image-20200624114543774
  4. Bi-Directional Relationship Inferring Network for Referring Image Segmentation Huchuan Lu

    details image-20200624142324261
  5. Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension Qi Wu

    details image-20200624142428398
  6. Referring Image Segmentation via Cross-Modal Progressive Comprehension Si Liu

    details image-20200624142558684
  7. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension Si Liu

    details image-20200624142538795
  8. PhraseCut: Language-Based Image Segmentation in the Wild Adobe

    details image-20200624143321558

Video Grounding

  1. Dense Regression Network for Video Grounding

    details image-20200624150719562
  1. Video Object Grounding using Semantic Roles in Language Description Arka Sadhu

    details image-20200624150839459
  1. Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences Alibaba

    details image-20200624151008319
  1. Local-Global Video-Text Interactions for Temporal Grounding

    details image-20200624151503579
  1. Visual Grounding in Video for Unsupervised Word Translation

    details image-20200624151551261

Vision-Language Navigation

  1. Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks oral

    details image-20200624142656931
  2. REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments oral Peter Anderson Qi Wu, William Yang Wang

    details image-20200624142835921
  3. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

    details image-20200624152242793

Visual Dialog

  1. Iterative Context-Aware Graph Inference for Visual Dialog oral Zheng-jun Zha

    details image-20200624142940560
  2. Vision-Dialog Navigation by Exploring Cross-modal Memory

    details image-20200624151115660
  3. Two Causal Principles for Improving Visual Dialog Hanwang Zhang

    details image-20200624151646066

Scene Graph Generation

  1. Unbiased Scene Graph Generation From Biased Training oral Hanwang Zhang

    details image-20200624152538766
  2. GPS-Net: Graph Property Sensing Network for Scene Graph Generation oral Dacheng Tao

    details image-20200624152642647
  3. Action Genome: Actions as Composition of Spatio-temporal Scene Graphs Feifei Li

    details image-20200624153001847

Video-based Action Recognition

  1. SmallBigNet: Integrating Core and Contextual Views for Video Classification Yu Qiao
  2. 3DV: 3D Dynamic Voxel for Action Recognition in Depth Video
  3. Video Modeling with Correlation Networks Facebook AI
  4. X3D: Expanding Architectures for Efficient Video Recognition Facebook AI
  5. Regularization on Spatio-Temporally Smoothed Feature for Action Recognition
  6. Listen to Look: Action Recognition by Previewing Audio
  7. Speech2Action: Cross-modal Supervision for Action Recognition VGG
  8. Uncertainty-aware Score Distribution Learning for Action Quality Assessment
  9. FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding Dahua Lin
  10. Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks
  11. TEA: Temporal Excitation and Aggregation for Action Recognition
  12. Intra- and Inter-Action Understanding via Temporal Action Parsing Dahua lin
  13. Temporal Pyramid Network for Action Recognition
  14. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Skeleton-based Action Recognition

  1. Context Aware Graph Convolution for Skeleton-Based Action Recognition Dacheng Tao
  2. PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition
  3. Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition MSRA
  4. Skeleton-Based Action Recognition with Shift Graph Convolutional Network
  5. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition Wanli Ouyang

Action Detection

  1. G-TAD: Sub-Graph Localization for Temporal Action Detection
  2. Learning Temporal Co-Attention Models for Unsupervised Video Action Localization
  3. Weakly-Supervised Action Localization by Generative Attention Modeling
  4. Learning to Discriminate Information for Online Action Detection

Action Segmentation

  1. Action Segmentation with Joint Self-Supervised Temporal Domain Adaptation
  2. SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation
  3. Improving Action Segmentation via Graph Based Temporal Reasoning
  4. Set-Constrained Viterbi for Set-Supervised Action Segmentation

Video Representation

  1. Large Scale Video Representation Learning via Relational Graph Clustering
  2. Screencast Tutorial Video Understanding
  3. Evolving Losses for Unsupervised Video Representation Learning
  4. A Multigrid Method for Efficiently Training Video Models Kaiming He

Others

  1. Visual Commonsense R-CNN Hanwang Zhang

    details image-20200624151255181
  2. Straight to the Point: Fast-forwarding Videos via Reinforcement Learning Using Textual Data

    details image-20200624151933970