Deep Learning for Computer Vision
Overview The automatic analysis and understanding of images and videos, a field called Computer Vision, occupies significant importance in applications including security, healthcare, entertainment, mobility, etc. The recent success of deep learning methods has revolutionized the field of computer vision, making new developments increasingly closer to deployment that benefits end users. This course will introduce the students to traditional computer vision topics, before presenting deep learning methods for computer vision. The course will cover basics as well as recent advancements in these areas, which will help the student learn the basics as well as become proficient in applying these methods to real-world applications.
Course content
- Introduction and Overview: Course Overview and Motivation; Introduction to Image Formation, Capture and Representation; Linear Filtering, Correlation, Convolution (3 lectures)
- Visual Features and Representations: Edge, Blobs, Corner Detection; Scale Space and Scale Selection; SIFT, SURF; HoG, LBP, etc. (3 lectures, 1 lab)
- Visual Matching: Bag-of-words, VLAD; RANSAC, Hough transform; Pyramid Matching; Optical Flow (3 lectures, 1 lab)
- Convolutional Neural Networks (CNNs): Introduction to CNNs; Evolution of CNN Architectures: AlexNet, ZFNet, VGG, InceptionNets, ResNets, DenseNets (4 lectures, 1 lab)
- CNNs for Recognition, Verification, Detection, Segmentation: CNNs for Recognition and Verification (Siamese Networks, Triplet Loss, Ranking Loss); CNNs for Detection: R-CNN, Fast R-CNN, YOLO; CNNs for Segmentation: FCN, SegNet, U-Net, Mask-RCNN (9 lectures, 2 labs)
- Recurrent Neural Networks (RNNs): Review of RNNs; CNN + RNN Models for Video Understanding: Spatio-temporal Models, Action/Activity Recognition (4 lectures, 1 lab)
- Attention Models: Introduction to Attention Models in Vision; Vision and Language: Image Captioning, Visual QA, Visual Dialog; Spatial Transformers; Transformer Networks (5 lectures, 2 labs)
- Deep Generative Models: Review of (Popular) Deep Generative Models: GANs, VAEs; Other Generative Models: PixelRNNs, NADE, Normalizing Flows, etc (4 lectures, 1 lab)
- Multi-modal Learning: Representation, Alignment and Generation of Multi-modal data (8 lectures, 2 lab)
Application Areas_
Industrial Activity recognition, PPE (personal protective equipment) detection, Machine failure analysis, Vibration analysis, Spark detection etc. Medical Image analysis : Opthalmology, Dermatology, Dentistry, Radiology
Grading Scheme
- Two tierce exams - 45%
- Assignment (two) - 20%
- Paper Presentation - 10%
- Project+Viva - 25%
Textbooks
- https://www.bishopbook.com/
- Richard Szeliski, Computer Vision: Algorithms and Applications, 2010.
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, Deep Learning, 2016 Michael Nielsen, Neural Networks and Deep Learning, 2016
- Yoshua Bengio, Learning Deep Architectures for AI, 2009
- Simon Prince, Computer Vision: Models, Learning, and Inference, 2012.
- David Forsyth, Jean Ponce, Computer Vision: A Modern Approach, 2002.
Tutorials
- PyTorch University of Amesterdam: https://uvadlc-notebooks.readthedocs.io/en/latest/
- Deep Mind Lecture series: https://www.youtube.com/watch?v=7R52wiUgxZI&list=PLqYmG7hTraZCDxZ44o4p3N5Anz3lLRVZF
- Energy Based Models -- Deep Learning lectures: https://atcold.github.io/NYU-DLSP21/
- Computational Creativity -- https://richradke.github.io/computationalcreativity/
- Image processing -- https://sites.ecse.rpi.edu/~rjradke/improccourse.html
Datasets ImageNet: Large-scale image classification dataset VQA: Visual Question Answering dataset Microsoft COCO: Large-scale image detection, segmentation, and captioning dataset LSMDC: Large-Scale Movie Description Dataset and challenge Madlibs: Visual fil-in-the-blank dataset ReferIt: Dataset of visual referring expressions VisDial: Visual dialog dataset ActivityNet Captions: A large-scale benchmark for dense-captioning of events in video VisualGenome: A large-scale image-language dataset that includes region captions, relationships, visual questions, attributes and more VIST: VIsual StroyTelling dataset CLEVR: Compositional Language and Elementary Visual Reasoning dataset COMICS: Dataset of annotated comics with visual panels and dialog transcriptions Toronto COCO-QA: Toronto question answering dataset Text-to-image coreference: multi-sentence descriptions of RGB-D scenes, annotations for image-to-text and text-to-text coreference MovieQA: automatic story comprehension dataset from both video and text. Charades Dataset: Large-scale video dataset for activity recognition and commonsense reasoning imSitu: Situational recognition datasets with annotations of main activities, participating actors, objects, substances, and locations and the roles these participants play in the activity. MIT-SoundNet: Flickr video dataset for cross-modal recognition, including audio and video.
Bibliography
Reading List Medical Imaging
- Med-Gemini: "Capabilities of Gemini Models in Medicine" https://arxiv.org/pdf/2404.18416
Vision Language Pre-training
- Survey Paper: https://arxiv.org/pdf/2210.09263
Deep Learning for Image/Video Restoration and Super-resolution
- Survey Paper:
Semantic Image Segmentation
- Survey Paper: https://arxiv.org/pdf/2302.06378
- Object Segmentation: https://arxiv.org/pdf/2301.07499
Video Summarization
- Survey Paper: https://arxiv.org/pdf/2210.11707
Multi-modal Foundation Models
- Multimodal Foundation Models: From Specialists to General-Purpose Assistants: https://arxiv.org/abs/2309.10020
Computational Photography
- ImageAlignmentAndStitching survey(2006): https://courses.cs.washington.edu/courses/cse576/05sp/papers/MSR-TR-2004-92.pdf
- Reading List: https://github.com/visionxiang/awesome-computational-photography
Assorted
- Camera Models and Fundamental Concepts Used in Geometric Computer Vision: https://inria.hal.science/inria-00590269/file/sturm-ftcgv-2011.pdf
- Sparse Modeling for Image and Vision Processing: https://arxiv.org/abs/1411.3230
- A Survey of Unsupervised Domain Adaptation for Visual Recognition: https://arxiv.org/abs/2112.06745
- Computer Vision for Autonomous Vehicles: Problems, Datasets and State of the Art: https://arxiv.org/abs/1704.05519
- Towards Better User Studies in Computer Graphics and Vision: https://arxiv.org/abs/2206.11461
CVPR 2024
- VicTR: Video-conditioned Text Representations for Activity Recognition: https://arxiv.org/pdf/2304.02560
- Action-slot: Visual Action-centric Representations for Atomic Activity Recognition in Traffic Scenes: https://hcis-lab.github.io/Action-slot/
- Learning Group Activity Features Through Person Attribute Prediction: https://arxiv.org/pdf/2403.02753
- Group Activity