
ICCV 2021 论文和开源项目合集(papers with code)!

1617 papers accepted - 25.9% acceptance rate

ICCV 2021 收录论文IDs:https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRfaTmsNweuaA0Gjyu58H_Cx56pGwFhcTYII0u1pg0U7MbhlgY0R6Y-BbK3xFhAiwGZ26u3TAtN5MnS/pubhtml

注1:欢迎各位大佬提交issue,分享ICCV 2021论文和开源项目!

注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision

【ICCV 2021 论文和开源目录】


Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

AutoFormer: Searching Transformers for Visual Recognition

Bias Loss for Mobile Neural Networks

Vision Transformer with Progressive Sampling

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Rethinking Spatial Dimensions of Vision Transformers

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Visual Transformer

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

An Empirical Study of Training Self-Supervised Vision Transformers

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Group-Free 3D Object Detection via Transformers

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

Rethinking and Improving Relative Position Encoding for Vision Transformer

Emerging Properties in Self-Supervised Vision Transformers

Learning Spatio-Temporal Transformer for Visual Tracking

Fast Convergence of DETR with Spatially Modulated Co-Attention

Vision Transformer with Progressive Sampling

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Rethinking Spatial Dimensions of Vision Transformers


Labels4Free: Unsupervised Segmentation using StyleGAN

GNeRF: GAN-based Neural Radiance Field without Posed Camera

EigenGAN: Layer-Wise Eigen-Learning for GANs

From Continuity to Editability: Inverting GANs with Consecutive Images

Sketch Your Own GAN


AutoFormer: Searching Transformers for Visual Recognition


GNeRF: GAN-based Neural Radiance Field without Posed Camera

KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

In-Place Scene Labelling and Understanding with Implicit Scene Representation

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis


Rank & Sort Loss for Object Detection and Instance Segmentation

Bias Loss for Mobile Neural Networks

Zero-Shot Learning

FREE: Feature Refinement for Generalized Zero-Shot Learning

Few-Shot Learning

Few-Shot and Continual Learning with Attentive Independent Mechanisms


Parametric Contrastive Learning

Vision and Language

VLGrammar: Grounded Grammar Induction of Vision and Language


An Empirical Study of Training Self-Supervised Vision Transformers

DetCo: Unsupervised Contrastive Learning for Object Detection

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

Multi-Label Image Recognition(多标签图像识别)

Residual Attention: A Simple but Effective Method for Multi-Label Recognition

2D目标检测(Object Detection)

DetCo: Unsupervised Contrastive Learning for Object Detection

Detecting Invisible People

Active Learning for Deep Object Detection via Probabilistic Modeling

Conditional Variational Capsule Network for Open Set Recognition

MDETR : Modulated Detection for End-to-End Multi-Modal Understanding

Rank & Sort Loss for Object Detection and Instance Segmentation

SimROD: A Simple Adaptation Method for Robust Object Detection

GraphFPN: Graph Feature Pyramid Network for Object Detection

Fast Convergence of DETR with Spatially Modulated Co-Attention


End-to-End Semi-Supervised Object Detection with Soft Teacher

语义分割(Semantic Segmentation)

Personalized Image Semantic Segmentation

Standardized Max Logits: A Simple yet Effective Approach for Identifying Unexpected Road Obstacles in Urban-Scene Segmentation

半监督语义分割(Semi-supervised Semantic Segmentation)

Leveraging Auxiliary Tasks with Affinity Learning for Weakly Supervised Semantic Segmentation

Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation

无监督分割(Unsupervised Segmentation)

Labels4Free: Unsupervised Segmentation using StyleGAN

实例分割(Instance Segmentation)

Instances as Queries

Crossover Learning for Fast Online Video Instance Segmentation

Rank & Sort Loss for Object Detection and Instance Segmentation

医学图像分割(Medical Image Segmentation)

Recurrent Mask Refinement for Few-Shot Medical Image Segmentation

Few-shot Segmentation

Mining Latent Classes for Few-shot Segmentation

人体运动分割(Human Motion Segmentation)

Graph Constrained Data Representation Learning for Human Motion Segmentation

目标跟踪(Object Tracking)

Learning Spatio-Temporal Transformer for Visual Tracking

Learning to Adversarially Blur Visual Object Tracking

HiFT: Hierarchical Feature Transformer for Aerial Tracking

Learn to Match: Automatic Matching Network Design for Visual Tracking

3D Point Cloud

Unsupervised Point Cloud Pre-Training via View-Point Occlusion, Completion

Point Cloud Object Detection(点云目标检测)

Group-Free 3D Object Detection via Transformers

Point Cloud Semantic Segmentation(点云语义分割)

ReDAL: Region-based and Diversity-aware Active Learning for Point Cloud Semantic Segmentation

Learning with Noisy Labels for Robust Point Cloud Segmentation

VMNet: Voxel-Mesh Network for Geodesic-Aware 3D Semantic Segmentation

Sparse-to-dense Feature Matching: Intra and Inter domain Cross-modal Learning in Domain Adaptation for 3D Semantic Segmentation

Point Cloud Instance Segmentation(点云实例分割)

Hierarchical Aggregation for 3D Instance Segmentation

Point Cloud Denoising(点云去噪)

Score-Based Point Cloud Denoising

Point Cloud Registration(点云配准)

HRegNet: A Hierarchical Network for Large-scale Outdoor LiDAR Point Cloud Registration


Learning for Scale-Arbitrary Super-Resolution from Scale-Specific Networks

视频插帧(Video Frame Interpolation)

XVFI: eXtreme Video Frame Interpolation

行人重识别(Person Re-identification)

TransReID: Transformer-based Object Re-Identification

IDM: An Intermediate Domain Module for Domain Adaptive Person Re-ID

2D/3D人体姿态估计(2D/3D Human Pose Estimation)

2D 人体姿态估计

Human Pose Regression with Residual Log-likelihood Estimation

Online Knowledge Distillation for Efficient Pose Estimation

3D 人体姿态估计

Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows

3D人头重建(3D Head Reconstruction)

H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction

行为识别(Action Recognition)

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

Channel-wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

时序动作定位(Temporal Action Localization)

Enriching Local and Global Contexts for Temporal Action Localization

文本检测(Text Detection)

Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection

文本识别(Text Recognition)

Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

视觉问答(Visual Question Answering, VQA)

Greedy Gradient Ensemble for Robust Visual Question Answering

对抗攻击(Adversarial Attack)

Feature Importance-aware Transferable Adversarial Attacks

深度估计(Depth Estimation)


MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

视线估计(Gaze Estimation)

Generalizing Gaze Estimation with Outlier-guided Collaborative Adaptation

人群计数(Crowd Counting)

Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework

Uniformity in Heterogeneity:Diving Deep into Count Interval Partition for Crowd Counting

轨迹预测(Trajectory Prediction)

Human Trajectory Prediction via Counterfactual Analysis

Personalized Trajectory Prediction via Distribution Discrimination

异常检测(Anomaly Detection)

Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning

场景图生成(Scene Graph Generation)

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

图像编辑(Image Editing)

Sketch Your Own GAN

Unsupervised Domain Adaptation

Recursively Conditional Gaussian for Ordinal Unsupervised Domain Adaptation

Video Rescaling

Self-Conditioned Probabilistic Learning of Video Rescaling

Hand-Object Interaction

Learning a Contact Potential Field to Model the Hand-Object Interaction


XVFI: eXtreme Video Frame Interpolation

Personalized Image Semantic Segmentation

H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction


Out-of-Core Surface Reconstruction via Global TGV Minimization

Progressive Correspondence Pruning by Consensus Learning


Energy-Based Open-World Uncertainty Modeling for Confidence Calibration

Generalized Shuffled Linear Regression

Discovering 3D Parts from Image Collections

Semi-Supervised Active Learning with Temporal Output Discrepancy

Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?

Paper: https://arxiv.org/abs/2105.02498

Code: https://github.com/KingJamesSong/DifferentiableSVD

Hand-Object Contact Consistency Reasoning for Human Grasps Generation

Equivariant Imaging: Learning Beyond the Range Space

Just Ask: Learning to Answer Questions from Millions of Narrated Videos