/CVPR2024-Papers-with-Code

CVPR 2024 论文和开源项目合集(标题翻译中文版)

CVPR 2024 论文和开源项目合集(Papers with Code)

CVPR 2024 decisions are now available on OpenReview!

注0:项目来自于 https://github.com/amusi/CVPR2024-Papers-with-Code, 当前项目将原文里的标题用翻译工具转为中文,未做修订,仅作参考

注1:欢迎各位大佬提交issue,分享CVPR 2024论文和开源项目!

注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision

欢迎扫码加入【CVer学术交流群】,这是最大的计算机视觉AI知识星球!每日更新,第一时间分享最新最前沿的计算机视觉、AI绘画、图像处理、深度学习、自动驾驶、医疗影像和AIGC等方向的学习资料,学起来!

【CVPR 2024 论文开源目录】

3DGS(Gaussian Splatting)

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering Scaffold-GS:结构化3D高斯函数,用于视图自适应渲染

GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis GPS-Gaussian:可泛化的像素级3D高斯分层技术,用于实时生成人类新颖视角合成

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians 高斯头像:通过可动3D高斯实现从单个视频中生成逼真的人类头像建模

GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting 高斯编辑器:利用高斯喷溅技术实现快速可控的3D编辑

Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction 可变形3D高斯函数用于高保真单目动态场景重建

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes SC-GS:用于可编辑动态场景的稀疏控制高斯喷溅

Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis 时空高斯特征喷溅技术用于实时动态视图合成

DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization DNGaussian:通过全局-局部深度归一化优化稀疏视图3D高斯辐射场

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering 实时动态场景渲染的4D高斯散斑技术

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models 高斯梦者:通过连接二维和三维扩散模型实现从文本到3D高斯的快速生成

Avatars

GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians 高斯头像:通过可动画的3D高斯实现从单个视频到逼真的人像建模

Real-Time Simulated Avatar from Head-Mounted Sensors 实时模拟头部佩戴传感器生成的虚拟形象

Backbone

RepViT: Revisiting Mobile CNN From ViT Perspective RepViT:从ViT视角重新审视移动CNN

TransNeXt: Robust Foveal Visual Perception for Vision Transformers TransNeXt:针对视觉Transformer的鲁棒性黄斑视觉感知

CLIP

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want Alpha-CLIP:一个聚焦于您所想之处的CLIP模型

FairCLIP: Harnessing Fairness in Vision-Language Learning 公平CLIP:在视觉-语言学习中利用公平性

MAE

Embodied AI

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI 具身扫描:面向具身人工智能的全方位多模态3D感知套件

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception MP5:通过主动感知在Minecraft中的多模态开放式具身系统

LEMON: Learning 3D Human-Object Interaction Relation from 2D Images 柠檬:从二维图像中学习3D人-物交互关系

GAN

OCR

An Empirical Study of Scaling Law for OCR OCR缩放定律的实证研究

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting ODM:一种用于场景文本检测和定位的文本-图像进一步对齐预训练方法

NeRF

PIE-NeRF🍕: Physics-based Interactive Elastodynamics with NeRF PIE-NeRF🍕:基于物理的交互式弹性动力学与NeRF

DETR

DETRs Beat YOLOs on Real-time Object Detection DETR在实时目标检测上击败了YOLOs

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement 显著性DETR:通过层次显著性过滤精炼增强检测Transformer

Prompt

多模态大语言模型(MLLM)

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration mPLUG-Owl2:通过模态协作革新多模态大型语言模型

Link-Context Learning for Multimodal LLMs 多模态LLM的链接上下文学习

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation OPERA:通过过度信任惩罚和反思-分配缓解多模态大型语言模型中的幻觉

Making Large Multimodal Models Understand Arbitrary Visual Prompts 制作能够理解任意视觉提示的大型多模态模型

Pink: Unveiling the power of referential comprehension for multi-modal llms 粉红色:揭示多模态LLMs中参照理解的力量

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Chat-UniVi:统一视觉表示通过图像和视频理解赋能大型语言模型

OneLLM: One Framework to Align All Modalities with Language OneLLM:一个框架,将所有模态与语言对齐

大语言模型(LLM)

VTimeLLM: Empower LLM to Grasp Video Moments VTimeLLM:赋予LLM把握视频瞬间的能力

NAS

ReID(重识别)

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification 魔法令牌:为多模态物体重识别选择多样化的令牌

Noisy-Correspondence Learning for Text-to-Image Person Re-identification 文本到图像人物重识别的噪声对应学习

扩散模型(Diffusion Models)

InstanceDiffusion: Instance-level Control for Image Generation 实例扩散:图像生成中的实例级控制

Residual Denoising Diffusion Models 残差去噪扩散模型

DeepCache: Accelerating Diffusion Models for Free DeepCache:免费加速扩散模型

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations DEADiff:一种具有解耦表示的高效风格扩散模型

SVGDreamer: Text Guided SVG Generation with Diffusion Model SVGDreamer:基于扩散模型的文本引导SVG生成

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model 交互式扩散:文本到图像扩散模型的交互控制

MMA-Diffusion: MultiModal Attack on Diffusion Models MMA-Diffusion:对扩散模型的跨模态攻击

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models 视频运动定制:利用时间注意力自适应的文本到视频扩散模型

Vision Transformer

TransNeXt: Robust Foveal Visual Perception for Vision Transformers TransNeXt:为视觉Transformer提供鲁棒的黄斑视觉感知

RepViT: Revisiting Mobile CNN From ViT Perspective RepViT:从ViT视角重新审视移动CNN

A General and Efficient Training for Transformer via Token Expansion 通过词元扩展进行通用且高效的Transformer训练

视觉和语言(Vision-Language)

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models 提示KD:用于视觉-语言模型的无监督提示蒸馏

FairCLIP: Harnessing Fairness in Vision-Language Learning 公平CLIP:在视觉语言学习中利用公平性

目标检测(Object Detection)

DETRs Beat YOLOs on Real-time Object Detection DETRs在实时目标检测方面击败了YOLOs

Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation 利用零样本日夜间域适应增强目标检测

YOLO-World: Real-Time Open-Vocabulary Object Detection YOLO-World:实时开放词汇物体检测

Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement 显著性DETR:通过分层显著性滤波优化提升检测Transformer

异常检测(Anomaly Detection)

Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection 开放集监督异常检测中的异常异质性学习

目标跟踪(Object Tracking)

Delving into the Trajectory Long-tail Distribution for Muti-object Tracking 深入探究多目标跟踪中的轨迹长尾分布

语义分割(Semantic Segmentation)

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation 更强、更少、更优越:利用视觉基础模型实现领域泛化语义分割

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation 开放词汇语义分割的简单编码器-解码器:SED

医学图像(Medical Image)

Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology 特征再嵌入:迈向计算病理学基础模型级别的性能

VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis VoCo:一种简单而有效的3D医学图像分析体积对比学习框架

ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images ChAda-ViT:异构显微镜图像联合表示学习的通道自适应注意力

医学图像分割(Medical Image Segmentation)

自动驾驶(Autonomous Driving)

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving UniPAD:自动驾驶的通用预训练范式

Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications Cam4DOcc:自动驾驶应用中仅使用摄像头进行4D占用预测的基准测试

Memory-based Adapters for Online 3D Scene Perception 基于内存的在线3D场景感知适配器

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries 将3D语义场景补全与上下文实例查询同步化

A Real-world Large-scale Dataset for Roadside Cooperative Perception 真实世界大规模道路侧协同感知数据集

Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving 单视和多视深度自适应融合用于自动驾驶

Traffic Scene Parsing through the TSP6K Dataset 通过TSP6K数据集进行交通场景解析

3D点云(3D-Point-Cloud)

3D目标检测(3D Object Detection)

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection PTT:高效时序3D目标检测的点-轨迹变换器

UniMODE: Unified Monocular 3D Object Detection UniMODE:统一单目3D目标检测

3D语义分割(3D Semantic Segmentation)

图像编辑(Image Editing)

Edit One for All: Interactive Batch Image Editing 一键编辑:交互式批量图像编辑

视频编辑(Video Editing)

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers MaskINT:通过插值非自回归掩码变换器进行视频编辑

Low-level Vision

Residual Denoising Diffusion Models 残差去噪扩散模型

Boosting Image Restoration via Priors from Pre-trained Models 通过预训练模型先验信息增强图像恢复

超分辨率(Super-Resolution)

SeD: Semantic-Aware Discriminator for Image Super-Resolution SeD:图像超分辨率中的语义感知判别器

APISR: Anime Production Inspired Real-World Anime Super-Resolution APISR:受动画制作启发的现实世界动画超分辨率

去噪(Denoising)

图像去噪(Image Denoising)

3D人体姿态估计(3D Human Pose Estimation)

Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation 沙漏分词器用于高效基于Transformer的3D人体姿态估计

图像生成(Image Generation)

InstanceDiffusion: Instance-level Control for Image Generation 实例扩散:图像生成中的实例级控制

ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations ECLIPSE:一种高效利用资源的文本到图像生成先验

Instruct-Imagen: Image Generation with Multi-modal Instruction 指令-图像:多模态指令下的图像生成

Residual Denoising Diffusion Models 残差去噪扩散模型

UniGS: Unified Representation for Image Generation and Segmentation UniGS:图像生成与分割的统一表示

Multi-Instance Generation Controller for Text-to-Image Synthesis 多实例生成控制器,用于文本到图像合成

SVGDreamer: Text Guided SVG Generation with Diffusion Model SVGDreamer:基于扩散模型的文本引导SVG生成

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model 交互扩散:文本到图像扩散模型的交互控制

Ranni: Taming Text-to-Image Diffusion for Accurate Prompt Following Ranni:驯服文本到图像扩散,实现准确提示跟随

视频生成(Video Generation)

Vlogger: Make Your Dream A Vlog 视频博主:让你的梦想成为一档视频博客

VBench: Comprehensive Benchmark Suite for Video Generative Models VBench:视频生成模型的全面基准测试套件

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models 视频运动定制:利用时间注意力自适应的文本到视频扩散模型

3D生成

CityDreamer: Compositional Generative Model of Unbounded 3D Cities 城市梦想家:无限3D城市的构图生成模型

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching 清醒梦境者:通过区间得分匹配实现高保真文本到3D生成

视频理解(Video Understanding)

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark MVBench:一个全面的跨模态视频理解基准

知识蒸馏(Knowledge Distillation)

Logit Standardization in Knowledge Distillation 知识蒸馏中的Logit标准化

Efficient Dataset Distillation via Minimax Diffusion 通过最小-最大扩散进行高效数据集蒸馏

立体匹配(Stereo Matching)

Neural Markov Random Field for Stereo Matching 神经马尔可夫随机场用于立体匹配

场景图生成(Scene Graph Generation)

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation HiKER-SGG:层次知识增强鲁棒场景图生成

视频质量评价(Video Quality Assessment)

KVQ: Kaleidoscope Video Quality Assessment for Short-form Videos KVQ:短视频的万花筒视频质量评估

数据集(Datasets)

A Real-world Large-scale Dataset for Roadside Cooperative Perception 现实世界大规模道路侧协同感知数据集

Traffic Scene Parsing through the TSP6K Dataset 通过TSP6K数据集进行交通场景解析

其他(Others)

Object Recognition as Next Token Prediction 对象识别作为下一个标记预测

ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks ParameterNet:参数即是所有,用于移动网络大规模视觉预训练

Seamless Human Motion Composition with Blended Positional Encodings 无缝的人体运动合成与混合位置编码

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning LL3DA:用于全3D理解、推理和规划的视觉交互式指令调优

CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

MoMask: Generative Masked Modeling of 3D Human Motions MoMask:3D人体动作的生成式掩码建模

Amodal Ground Truth and Completion in the Wild

Improved Visual Grounding through Self-Consistent Explanations 通过自洽解释提升视觉定位

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object ImageNet-D:在扩散合成物体上基准测试神经网络鲁棒性

Learning from Synthetic Human Group Activities 从合成人类群体活动中学习

A Cross-Subject Brain Decoding Framework 跨学科大脑解码框架

Multi-Task Dense Prediction via Mixture of Low-Rank Experts 通过低秩专家混合的多任务密集预测

Contrastive Mean-Shift Learning for Generalized Category Discovery 对比均值漂移学习用于广义类别发现