CVPR 2024 decisions are now available on OpenReview!
注0:项目来自于 https://github.com/amusi/CVPR2024-Papers-with-Code, 当前项目将原文里的标题用翻译工具转为中文,未做修订,仅作参考
注1:欢迎各位大佬提交issue,分享CVPR 2024论文和开源项目!
注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision
欢迎扫码加入【CVer学术交流群】,这是最大的计算机视觉AI知识星球!每日更新,第一时间分享最新最前沿的计算机视觉、AI绘画、图像处理、深度学习、自动驾驶、医疗影像和AIGC等方向的学习资料,学起来!
- 3DGS(Gaussian Splatting)
- Avatars
- Backbone
- CLIP
- MAE
- Embodied AI
- GAN
- GNN
- 多模态大语言模型(MLLM)
- 大语言模型(LLM)
- NAS
- OCR
- NeRF
- DETR
- Prompt
- 扩散模型(Diffusion Models)
- ReID(重识别)
- 长尾分布(Long-Tail)
- Vision Transformer
- 视觉和语言(Vision-Language)
- 自监督学习(Self-supervised Learning)
- 数据增强(Data Augmentation)
- 目标检测(Object Detection)
- 异常检测(Anomaly Detection)
- 目标跟踪(Visual Tracking)
- 语义分割(Semantic Segmentation)
- 实例分割(Instance Segmentation)
- 全景分割(Panoptic Segmentation)
- 医学图像(Medical Image)
- 医学图像分割(Medical Image Segmentation)
- 视频目标分割(Video Object Segmentation)
- 视频实例分割(Video Instance Segmentation)
- 参考图像分割(Referring Image Segmentation)
- 图像抠图(Image Matting)
- 图像编辑(Image Editing)
- Low-level Vision
- 超分辨率(Super-Resolution)
- 去噪(Denoising)
- 去模糊(Deblur)
- 自动驾驶(Autonomous Driving)
- 3D点云(3D Point Cloud)
- 3D目标检测(3D Object Detection)
- 3D语义分割(3D Semantic Segmentation)
- 3D目标跟踪(3D Object Tracking)
- 3D语义场景补全(3D Semantic Scene Completion)
- 3D配准(3D Registration)
- 3D人体姿态估计(3D Human Pose Estimation)
- 3D人体Mesh估计(3D Human Mesh Estimation)
- 医学图像(Medical Image)
- 图像生成(Image Generation)
- 视频生成(Video Generation)
- 3D生成(3D Generation)
- 视频理解(Video Understanding)
- 行为检测(Action Detection)
- 文本检测(Text Detection)
- 知识蒸馏(Knowledge Distillation)
- 模型剪枝(Model Pruning)
- 图像压缩(Image Compression)
- 三维重建(3D Reconstruction)
- 深度估计(Depth Estimation)
- 轨迹预测(Trajectory Prediction)
- 车道线检测(Lane Detection)
- 图像描述(Image Captioning)
- 视觉问答(Visual Question Answering)
- 手语识别(Sign Language Recognition)
- 视频预测(Video Prediction)
- 新视点合成(Novel View Synthesis)
- Zero-Shot Learning(零样本学习)
- 立体匹配(Stereo Matching)
- 特征匹配(Feature Matching)
- 场景图生成(Scene Graph Generation)
- 隐式神经表示(Implicit Neural Representations)
- 图像质量评价(Image Quality Assessment)
- 视频质量评价(Video Quality Assessment)
- 数据集(Datasets)
- 新任务(New Tasks)
- 其他(Others)
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering Scaffold-GS:结构化3D高斯函数,用于视图自适应渲染
- Homepage: https://city-super.github.io/scaffold-gs/
- Paper: https://arxiv.org/abs/2312.00109
- Code: https://github.com/city-super/Scaffold-GS
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis GPS-Gaussian:可泛化的像素级3D高斯分层技术,用于实时生成人类新颖视角合成
- Homepage: https://shunyuanzheng.github.io/GPS-Gaussian
- Paper: https://arxiv.org/abs/2312.02155
- Code: https://github.com/ShunyuanZheng/GPS-Gaussian
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians 高斯头像:通过可动3D高斯实现从单个视频中生成逼真的人类头像建模
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting 高斯编辑器:利用高斯喷溅技术实现快速可控的3D编辑
Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction 可变形3D高斯函数用于高保真单目动态场景重建
- Homepage: https://ingra14m.github.io/Deformable-Gaussians/
- Paper: https://arxiv.org/abs/2309.13101
- Code: https://github.com/ingra14m/Deformable-3D-Gaussians
SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes SC-GS:用于可编辑动态场景的稀疏控制高斯喷溅
- Homepage: https://yihua7.github.io/SC-GS-web/
- Paper: https://arxiv.org/abs/2312.14937
- Code: https://github.com/yihua7/SC-GS
Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis 时空高斯特征喷溅技术用于实时动态视图合成
- Homepage: https://oppo-us-research.github.io/SpacetimeGaussians-website/
- Paper: https://arxiv.org/abs/2312.16812
- Code: https://github.com/oppo-us-research/SpacetimeGaussians
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization DNGaussian:通过全局-局部深度归一化优化稀疏视图3D高斯辐射场
- Homepage: https://fictionarry.github.io/DNGaussian/
- Paper: https://arxiv.org/abs/2403.06912
- Code: https://github.com/Fictionarry/DNGaussian
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering 实时动态场景渲染的4D高斯散斑技术
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models 高斯梦者:通过连接二维和三维扩散模型实现从文本到3D高斯的快速生成
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians 高斯头像:通过可动画的3D高斯实现从单个视频到逼真的人像建模
Real-Time Simulated Avatar from Head-Mounted Sensors 实时模拟头部佩戴传感器生成的虚拟形象
- Homepage: https://www.zhengyiluo.com/SimXR/
- Paper: https://arxiv.org/abs/2403.06862
RepViT: Revisiting Mobile CNN From ViT Perspective RepViT:从ViT视角重新审视移动CNN
TransNeXt: Robust Foveal Visual Perception for Vision Transformers TransNeXt:针对视觉Transformer的鲁棒性黄斑视觉感知
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want Alpha-CLIP:一个聚焦于您所想之处的CLIP模型
FairCLIP: Harnessing Fairness in Vision-Language Learning 公平CLIP:在视觉-语言学习中利用公平性
- Paper: https://arxiv.org/abs/2403.19949
- Code: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI 具身扫描:面向具身人工智能的全方位多模态3D感知套件
- Homepage: https://tai-wang.github.io/embodiedscan/
- Paper: https://arxiv.org/abs/2312.16170
- Code: https://github.com/OpenRobotLab/EmbodiedScan
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception MP5:通过主动感知在Minecraft中的多模态开放式具身系统
- Homepage: https://iranqin.github.io/MP5.github.io/
- Paper: https://arxiv.org/abs/2312.07472
- Code: https://github.com/IranQin/MP5
LEMON: Learning 3D Human-Object Interaction Relation from 2D Images 柠檬:从二维图像中学习3D人-物交互关系
An Empirical Study of Scaling Law for OCR OCR缩放定律的实证研究
- Paper: https://arxiv.org/abs/2401.00028
- Code: https://github.com/large-ocr-model/large-ocr-model.github.io
ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting ODM:一种用于场景文本检测和定位的文本-图像进一步对齐预训练方法
PIE-NeRF🍕: Physics-based Interactive Elastodynamics with NeRF PIE-NeRF🍕:基于物理的交互式弹性动力学与NeRF
DETRs Beat YOLOs on Real-time Object Detection DETR在实时目标检测上击败了YOLOs
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement 显著性DETR:通过层次显著性过滤精炼增强检测Transformer
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration mPLUG-Owl2:通过模态协作革新多模态大型语言模型
- Paper: https://arxiv.org/abs/2311.04257
- Code: https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2
Link-Context Learning for Multimodal LLMs 多模态LLM的链接上下文学习
- Paper: https://arxiv.org/abs/2308.07891
- Code: https://github.com/isekai-portal/Link-Context-Learning/tree/main
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation OPERA:通过过度信任惩罚和反思-分配缓解多模态大型语言模型中的幻觉
Making Large Multimodal Models Understand Arbitrary Visual Prompts 制作能够理解任意视觉提示的大型多模态模型
- Homepage: https://vip-llava.github.io/
- Paper: https://arxiv.org/abs/2312.00784
Pink: Unveiling the power of referential comprehension for multi-modal llms 粉红色:揭示多模态LLMs中参照理解的力量
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Chat-UniVi:统一视觉表示通过图像和视频理解赋能大型语言模型
OneLLM: One Framework to Align All Modalities with Language OneLLM:一个框架,将所有模态与语言对齐
VTimeLLM: Empower LLM to Grasp Video Moments VTimeLLM:赋予LLM把握视频瞬间的能力
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification 魔法令牌:为多模态物体重识别选择多样化的令牌
Noisy-Correspondence Learning for Text-to-Image Person Re-identification 文本到图像人物重识别的噪声对应学习
InstanceDiffusion: Instance-level Control for Image Generation 实例扩散:图像生成中的实例级控制
Residual Denoising Diffusion Models 残差去噪扩散模型
DeepCache: Accelerating Diffusion Models for Free DeepCache:免费加速扩散模型
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations DEADiff:一种具有解耦表示的高效风格扩散模型
-
Homepage: https://tianhao-qi.github.io/DEADiff/
SVGDreamer: Text Guided SVG Generation with Diffusion Model SVGDreamer:基于扩散模型的文本引导SVG生成
InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model 交互式扩散:文本到图像扩散模型的交互控制
MMA-Diffusion: MultiModal Attack on Diffusion Models MMA-Diffusion:对扩散模型的跨模态攻击
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models 视频运动定制:利用时间注意力自适应的文本到视频扩散模型
- Homeoage: https://video-motion-customization.github.io/
- Paper: https://arxiv.org/abs/2312.00845
- Code: https://github.com/HyeonHo99/Video-Motion-Customization
TransNeXt: Robust Foveal Visual Perception for Vision Transformers TransNeXt:为视觉Transformer提供鲁棒的黄斑视觉感知
RepViT: Revisiting Mobile CNN From ViT Perspective RepViT:从ViT视角重新审视移动CNN
A General and Efficient Training for Transformer via Token Expansion 通过词元扩展进行通用且高效的Transformer训练
PromptKD: Unsupervised Prompt Distillation for Vision-Language Models 提示KD:用于视觉-语言模型的无监督提示蒸馏
FairCLIP: Harnessing Fairness in Vision-Language Learning 公平CLIP:在视觉语言学习中利用公平性
- Paper: https://arxiv.org/abs/2403.19949
- Code: https://github.com/Harvard-Ophthalmology-AI-Lab/FairCLIP
DETRs Beat YOLOs on Real-time Object Detection DETRs在实时目标检测方面击败了YOLOs
Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation 利用零样本日夜间域适应增强目标检测
- Paper: https://arxiv.org/abs/2312.01220
- Code: https://github.com/ZPDu/Boosting-Object-Detection-with-Zero-Shot-Day-Night-Domain-Adaptation
YOLO-World: Real-Time Open-Vocabulary Object Detection YOLO-World:实时开放词汇物体检测
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement 显著性DETR:通过分层显著性滤波优化提升检测Transformer
Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection 开放集监督异常检测中的异常异质性学习
Delving into the Trajectory Long-tail Distribution for Muti-object Tracking 深入探究多目标跟踪中的轨迹长尾分布
- Paper: https://arxiv.org/abs/2403.04700
- Code: https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT
Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation 更强、更少、更优越:利用视觉基础模型实现领域泛化语义分割
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation 开放词汇语义分割的简单编码器-解码器:SED
Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology 特征再嵌入:迈向计算病理学基础模型级别的性能
VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis VoCo:一种简单而有效的3D医学图像分析体积对比学习框架
ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images ChAda-ViT:异构显微镜图像联合表示学习的通道自适应注意力
UniPAD: A Universal Pre-training Paradigm for Autonomous Driving UniPAD:自动驾驶的通用预训练范式
Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications Cam4DOcc:自动驾驶应用中仅使用摄像头进行4D占用预测的基准测试
Memory-based Adapters for Online 3D Scene Perception 基于内存的在线3D场景感知适配器
Symphonize 3D Semantic Scene Completion with Contextual Instance Queries 将3D语义场景补全与上下文实例查询同步化
A Real-world Large-scale Dataset for Roadside Cooperative Perception 真实世界大规模道路侧协同感知数据集
Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving 单视和多视深度自适应融合用于自动驾驶
Traffic Scene Parsing through the TSP6K Dataset 通过TSP6K数据集进行交通场景解析
PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection PTT:高效时序3D目标检测的点-轨迹变换器
UniMODE: Unified Monocular 3D Object Detection UniMODE:统一单目3D目标检测
Edit One for All: Interactive Batch Image Editing 一键编辑:交互式批量图像编辑
- Homepage: https://thaoshibe.github.io/edit-one-for-all
- Paper: https://arxiv.org/abs/2401.10219
- Code: https://github.com/thaoshibe/edit-one-for-all
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers MaskINT:通过插值非自回归掩码变换器进行视频编辑
-
Homepage: https://maskint.github.io
Residual Denoising Diffusion Models 残差去噪扩散模型
Boosting Image Restoration via Priors from Pre-trained Models 通过预训练模型先验信息增强图像恢复
SeD: Semantic-Aware Discriminator for Image Super-Resolution SeD:图像超分辨率中的语义感知判别器
APISR: Anime Production Inspired Real-World Anime Super-Resolution APISR:受动画制作启发的现实世界动画超分辨率
Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation 沙漏分词器用于高效基于Transformer的3D人体姿态估计
InstanceDiffusion: Instance-level Control for Image Generation 实例扩散:图像生成中的实例级控制
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations ECLIPSE:一种高效利用资源的文本到图像生成先验
-
Homepage: https://eclipse-t2i.vercel.app/
Instruct-Imagen: Image Generation with Multi-modal Instruction 指令-图像:多模态指令下的图像生成
Residual Denoising Diffusion Models 残差去噪扩散模型
UniGS: Unified Representation for Image Generation and Segmentation UniGS:图像生成与分割的统一表示
Multi-Instance Generation Controller for Text-to-Image Synthesis 多实例生成控制器,用于文本到图像合成
SVGDreamer: Text Guided SVG Generation with Diffusion Model SVGDreamer:基于扩散模型的文本引导SVG生成
InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model 交互扩散:文本到图像扩散模型的交互控制
Ranni: Taming Text-to-Image Diffusion for Accurate Prompt Following Ranni:驯服文本到图像扩散,实现准确提示跟随
Vlogger: Make Your Dream A Vlog 视频博主:让你的梦想成为一档视频博客
VBench: Comprehensive Benchmark Suite for Video Generative Models VBench:视频生成模型的全面基准测试套件
- Homepage: https://vchitect.github.io/VBench-project/
- Paper: https://arxiv.org/abs/2311.17982
- Code: https://github.com/Vchitect/VBench
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models 视频运动定制:利用时间注意力自适应的文本到视频扩散模型
- Homeoage: https://video-motion-customization.github.io/
- Paper: https://arxiv.org/abs/2312.00845
- Code: https://github.com/HyeonHo99/Video-Motion-Customization
CityDreamer: Compositional Generative Model of Unbounded 3D Cities 城市梦想家:无限3D城市的构图生成模型
- Homepage: https://haozhexie.com/project/city-dreamer/
- Paper: https://arxiv.org/abs/2309.00610
- Code: https://github.com/hzxie/city-dreamer
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching 清醒梦境者:通过区间得分匹配实现高保真文本到3D生成
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark MVBench:一个全面的跨模态视频理解基准
- Paper: https://arxiv.org/abs/2311.17005
- Code: https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2
Logit Standardization in Knowledge Distillation 知识蒸馏中的Logit标准化
- Paper: https://arxiv.org/abs/2403.01427
- Code: https://github.com/sunshangquan/logit-standardization-KD
Efficient Dataset Distillation via Minimax Diffusion 通过最小-最大扩散进行高效数据集蒸馏
Neural Markov Random Field for Stereo Matching 神经马尔可夫随机场用于立体匹配
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation HiKER-SGG:层次知识增强鲁棒场景图生成
- Homepage: https://zhangce01.github.io/HiKER-SGG/
- Paper : https://arxiv.org/abs/2403.12033
- Code: https://github.com/zhangce01/HiKER-SGG
KVQ: Kaleidoscope Video Quality Assessment for Short-form Videos KVQ:短视频的万花筒视频质量评估
A Real-world Large-scale Dataset for Roadside Cooperative Perception 现实世界大规模道路侧协同感知数据集
Traffic Scene Parsing through the TSP6K Dataset 通过TSP6K数据集进行交通场景解析
Object Recognition as Next Token Prediction 对象识别作为下一个标记预测
ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks ParameterNet:参数即是所有,用于移动网络大规模视觉预训练
Seamless Human Motion Composition with Blended Positional Encodings 无缝的人体运动合成与混合位置编码
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning LL3DA:用于全3D理解、推理和规划的视觉交互式指令调优
-
Homepage: https://ll3da.github.io/
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
- Homepage: https://clova-tool.github.io/
- Paper: https://arxiv.org/abs/2312.10908
MoMask: Generative Masked Modeling of 3D Human Motions MoMask:3D人体动作的生成式掩码建模
Amodal Ground Truth and Completion in the Wild
- Homepage: https://www.robots.ox.ac.uk/~vgg/research/amodal/
- Paper: https://arxiv.org/abs/2312.17247
- Code: https://github.com/Championchess/Amodal-Completion-in-the-Wild
Improved Visual Grounding through Self-Consistent Explanations 通过自洽解释提升视觉定位
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object ImageNet-D:在扩散合成物体上基准测试神经网络鲁棒性
- Homepage: https://chenshuang-zhang.github.io/imagenet_d/
- Paper: https://arxiv.org/abs/2403.18775
- Code: https://github.com/chenshuang-zhang/imagenet_d
Learning from Synthetic Human Group Activities 从合成人类群体活动中学习
- Homepage: https://cjerry1243.github.io/M3Act/
- Paper https://arxiv.org/abs/2306.16772
- Code: https://github.com/cjerry1243/M3Act
A Cross-Subject Brain Decoding Framework 跨学科大脑解码框架
- Homepage: https://littlepure2333.github.io/MindBridge/
- Paper: https://arxiv.org/abs/2404.07850
- Code: https://github.com/littlepure2333/MindBridge
Multi-Task Dense Prediction via Mixture of Low-Rank Experts 通过低秩专家混合的多任务密集预测
Contrastive Mean-Shift Learning for Generalized Category Discovery 对比均值漂移学习用于广义类别发现
- Homepage: https://postech-cvlab.github.io/cms/
- Paper: https://arxiv.org/abs/2404.09451
- Code: https://github.com/sua-choi/CMS