CVPR 2023 论文和开源项目合集(Papers with Code)

CVPR 2023 论文和开源项目合集(papers with code)!

25.78% = 2360 / 9155

CVPR2023 decisions are now available on OpenReview! This year, wereceived a record number of 9155 submissions (a 12% increase over CVPR2022), and accepted 2360 papers, for a 25.78% acceptance rate.

注1:欢迎各位大佬提交issue,分享CVPR 2023论文和开源项目!



【CVPR 2023 论文开源目录】


Integrally Pre-Trained Transformer Pyramid Networks

Stitchable Neural Networks

Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks

BiFormer: Vision Transformer with Bi-Level Routing Attention

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

Vision Transformer with Super Token Sampling

Hard Patches Mining for Masked Image Modeling

SMPConv: Self-moving Point Representations for Continuous Convolution


GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

DeltaEdit: Exploring Text-free Training for Text-driven Image Manipulation


Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Generic-to-Specific Distillation of Masked Autoencoders


NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior

Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures

NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis

Panoptic Lifting for 3D Scene Understanding with Neural Fields

NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer

HNeRV: A Hybrid Neural Representation for Videos


DETRs with Hybrid Matching


Diversity-Aware Meta Visual Prompting


PA&DA: Jointly Sampling PAth and DAta for Consistent NAS


Structured 3D Features for Reconstructing Relightable and Animatable Avatars

Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos


Clothing-Change Feature Augmentation for Person Re-Identification

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID

Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification

Diffusion Models(扩散模型)

Video Probabilistic Diffusion Models in Projected Latent Space

Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models

Imagic: Text-Based Real Image Editing with Diffusion Models

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems

DiffRF: Rendering-guided 3D Radiance Field Diffusion

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising

TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets

Back to the Source: Diffusion-Driven Adaptation to Test-Time Corruption

DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration

Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion

Generative Diffusion Prior for Unified Image Restoration and Enhancement

Conditional Image-to-Video Generation with Latent Flow Diffusion Models


Long-Tailed Visual Recognition via Self-Heterogeneous Integration with Knowledge Excavation

Vision Transformer

Integrally Pre-Trained Transformer Pyramid Networks

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Learning Trajectory-Aware Transformer for Video Super-Resolution

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

BiFormer: Vision Transformer with Bi-Level Routing Attention

Vision Transformer with Super Token Sampling

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

BAEFormer: Bi-directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention


GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

Teaching Structured Vision&Language Concepts to Vision&Language Models

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

All in One: Exploring Unified Video-Language Pre-training

Position-guided Text Prompt for Vision Language Pre-training

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

Multi-Modal Representation Learning with Text-Driven Soft Masks

Learning to Name Classes for Vision and Language Models

目标检测(Object Detection)

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

DETRs with Hybrid Matching

Enhanced Training of Query-Based Object Detection via Selective Query Recollection

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection

目标跟踪(Object Tracking)

Simple Cues Lead to a Strong Multi-Object Tracker

语义分割(Semantic Segmentation)

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

医学图像分割(Medical Image Segmentation)

Label-Free Liver Tumor Segmentation

Directional Connectivity-based Segmentation of Medical Images

Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization

Fair Federated Medical Image Segmentation via Client Contribution Estimation

Ambiguous Medical Image Segmentation using Diffusion Models

Orthogonal Annotation Benefits Barely-supervised Medical Image Segmentation

MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery

MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation

Rethinking Few-Shot Medical Segmentation: A Vector Quantization View

Pseudo-label Guided Contrastive Learning for Semi-supervised Medical Image Segmentation

SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation

视频目标分割(Video Object Segmentation)

Two-shot Video Object Segmentation

Under Video Object Segmentation Section

参考图像分割(Referring Image Segmentation )

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation


Physical-World Optical Adversarial Attacks on 3D Face Recognition

IterativePFN: True Iterative Point Cloud Filtering

3D目标检测(3D Object Detection)

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection

3D Video Object Detection with Learnable Object-Centric Global Optimization

Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection

3D语义分割(3D Semantic Segmentation)

Less is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation

3D语义场景补全(3D Semantic Scene Completion)

3D配准(3D Registration)

Robust Outlier Rejection for 3D Registration with Variational Bayes

Low-level Vision

Causal-IR: Learning Distortion Invariant Representation for Image Restoration from A Causality Perspective

Burstormer: Burst Image Restoration and Enhancement Transformer

超分辨率(Video Super-Resolution)

Super-Resolution Neural Operator


Learning Trajectory-Aware Transformer for Video Super-Resolution

图像生成(Image Generation)

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Few-shot Semantic Image Synthesis with Class Affinity Transfer

TopNet: Transformer-based Object Placement Network for Image Compositing

视频生成(Video Generation)

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

视频理解(Video Understanding)

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Frame Flexible Network

Masked Motion Encoding for Self-Supervised Video Representation Learning

行为检测(Action Detection)

TriDet: Temporal Action Detection with Relative Boundary Modeling

文本检测(Text Detection)

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

知识蒸馏(Knowledge Distillation)

Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation

Generic-to-Specific Distillation of Masked Autoencoders

模型剪枝(Model Pruning)

DepGraph: Towards Any Structural Pruning

图像压缩(Image Compression)

Context-Based Trit-Plane Coding for Progressive Image Compression

异常检测(Anomaly Detection)

Deep Feature In-painting for Unsupervised Anomaly Detection in X-ray Images

三维重建(3D Reconstruction)

OReX: Object Reconstruction from Planar Cross-sections Using Neural Fields

SparsePose: Sparse-View Camera Pose Regression and Refinement

NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition

To fit or not to fit: Model-based Face Reconstruction and Occlusion Segmentation from Weak Supervision

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

3D Cinemagraphy from a Single Image

Revisiting Rotation Averaging: Uncertainties and Robust Losses

FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction

A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-The-Wild Images

深度估计(Depth Estimation)

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

轨迹预测(Trajectory Prediction)

IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction

EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

车道线检测(Lane Detection)

Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points

图像描述(Image Captioning)

ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing

Cross-Domain Image Captioning with Discriminative Finetuning

Model-Agnostic Gender Debiased Image Captioning

视觉问答(Visual Question Answering)

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

手语识别(Sign Language Recognition)

Continuous Sign Language Recognition with Correlation Network



视频预测(Video Prediction)

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

新视点合成(Novel View Synthesis)

3D Video Loops from Asynchronous Input

Zero-Shot Learning(零样本学习)

Bi-directional Distribution Alignment for Transductive Zero-Shot Learning

Semantic Prompt for Few-Shot Learning

立体匹配(Stereo Matching)

Iterative Geometry Encoding Volume for Stereo Matching

Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation

场景图生成(Scene Graph Generation)

Prototype-based Embedding Network for Scene Graph Generation

隐式神经表示(Implicit Neural Representations)

Polynomial Implicit Neural Representations For Large Diverse Datasets

图像质量评价(Image Quality Assessment)

Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild


Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

GeoNet: Benchmarking Unsupervised Adaptation across Geographies

CelebV-Text: A Large-Scale Facial Text-Video Dataset


Interactive Segmentation as Gaussian Process Classification

Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

SCOTCH and SODA: A Transformer Video Shadow Detection Framework

DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization

RelightableHands: Efficient Neural Relighting of Articulated Hand Models

Token Turing Machines

Single Image Backdoor Inversion via Robust Smoothed Classifiers

To fit or not to fit: Model-based Face Reconstruction and Occlusion Segmentation from Weak Supervision

HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics

A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others

RelightableHands: Efficient Neural Relighting of Articulated Hand Models

Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation

Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion Awareness

Learning Neural Parametric Head Models

A Meta-Learning Approach to Predicting Performance and Data Requirements

MACARONS: Mapping And Coverage Anticipation with RGB Online Self-Supervision

Masked Images Are Counterfactual Samples for Robust Fine-tuning

HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling

Decompose, Adjust, Compose: Effective Normalization by Playing with Frequency for Domain Generalization

Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization

Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples

Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes

UniHCP: A Unified Model for Human-Centric Perceptions

CUDA: Convolution-based Unlearnable Datasets

Masked Images Are Counterfactual Samples for Robust Fine-tuning

AdaptiveMix: Robust Feature Representation via Shrinking Feature Space

Physical-World Optical Adversarial Attacks on 3D Face Recognition

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models

Sharpness-Aware Gradient Matching for Domain Generalization

Mind the Label-shift for Augmentation-based Graph Out-of-distribution Generalization

Blind Video Deflickering by Neural Filtering with a Flawed Atlas

RiDDLE: Reversible and Diversified De-identification with Latent Encryptor

PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation

Upcycling Models under Domain and Category Shift

Modality-Agnostic Debiasing for Single Domain Generalization

Progressive Open Space Expansion for Open-Set Model Attribution

Dynamic Neural Network for Multi-Task Learning Searching across Diverse Network Topologies

GFPose: Learning 3D Human Pose Prior with Gradient Fields

PRISE: Demystifying Deep Lucas-Kanade with Strongly Star-Convex Constraints for Multimodel Image Alignment

Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings

Boundary Unlearning

ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing

Zero-shot Model Diagnosis

GeoNet: Benchmarking Unsupervised Adaptation across Geographies

Quantum Multi-Model Fitting

DivClust: Controlling Diversity in Deep Clustering

Neural Volumetric Memory for Visual Locomotion Control

MonoHuman: Animatable Human Neural Field from Monocular Video

Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion

Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification

HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering

On the Stability-Plasticity Dilemma of Class-Incremental Learning

Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning

VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution

Detecting and Grounding Multi-Modal Media Manipulation

Meta-causal Learning for Single Domain Generalization

Disentangling Writer and Character Styles for Handwriting Generation