Generative and Unsupervised Deep Learning @ KAIST

Course Information

Instructor: Sung Ju Hwang (sjhwang82@kaist.ac.kr)
TAs: Seul Lee (animecult@kaist.ac.kr), Geon Park and Sohyun An

Office: This is an on/offline hybrid course. Building Nubmer 9, Room 9201 (Instructor) 2nd floor (TAs)
Office hours: By appointment only.

Grading Policy

Absolute Grading
Paper Presentation: 25%
Attendance and Participation: 25%
Assignments and Project: 50%

Tentative Schedule

Dates	Topic
2/28	Course Introduction
3/2	Autoencoders and Variational Autoencoders (Lecture)
3/7	Transformers for Language and Vision (Lecture)
3/9	Transformers for Language and Vision (Lecture)
3/14	Self-Supervised Learning (Lecture) Review Due
3/16	Self-Supervised Learning (Lecture)
3/21	Self-Supervised Learning (Presentation)
3/23	Advanced VAEs and GANs (Lecture) Review Due
3/28	Advanced VAEs and GANs (Lecture) Review Due
3/30	Advanced VAEs and GANs (Presentation)
4/4	VAEs and GANs - VQVAE and VQGAN (Lab session), initial proposal due April 2nd
4/6	Autoregressive and Flow-based Models (Lecture)
4/11	Diffusion Models (Lecture) Review Due, Presentation Slides Due (Diffusion Models)
4/13	Diffusion Models (Lecture)
4/18	Diffusion Models (Presentation)
4/20	Mid-term Presentation, Presentation Slides Due (Large Language Models)
4/25	Large Language Models (Lecture) Review Due
4/27	Large Language Models (Presentation) Presentation Slides Due (Multimodal Foundation Models)
5/2	Multimodal Foundation Models (Lecture) Review Due
5/4	Multimodal Foundation Models (Presentation) Presentation Slides Due (Text-to-Image Generation)
5/9	Text-to-Image Generation (Lecture)
5/11	Text-to-Image Generation - LDM (Lab Session)
5/16	Text-to-Image Generation (Presentation) Review Due
5/23	Graph Representation Learning (GNN Basics, Graph SSL) (Lecture) Review Due, Presentation Slides Due (Graph Reprsentation Learning and Generation)
5/25	Graph Generation (Lecture)
5/30	Graph Generation (Presentation)
6/1	Molecular Graph Generation - GDSS, MOOD, DruM (Lab session), Presentation Slides Due (Speech Synthesis)
6/8	Speech Synthesis (Lecture) Review Due
6/13	Speech Synthesis (Presentation) Final report due
6/15	Final Presentation

Reading List

Transformers and Vision Transformers

[Vaswani et al. 17] Attention is All You Need, NeurIPS 2017.
[Beltagy et al. 20] Longformer: The Long-Document Transformer, arXiv preprint, 2020.
[Zaheer et al. 20] Big Bird: Transformers for Longer Sequences, NeurIPS 2020.
[Dosovitskiy et al. 21] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
[Touvron et al. 21] Training Data-efficient Image transformers & Distillation through Attention, ICML 2021.
[Tay et al. 21] Synthesizer: Rethinking Self-Attention for Transformer Models, ICML 2021.
[Liu et al. 21] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, ICCV 2021.
[Wu et al. 21] CvT: Introducing Convolutions to Vision Transformers, ICCV 2021.
[Dai et al. 21] CoAtNet: Marrying Convolution and Attnetion for All Data Sizes, NeurIPS 2021.
[Yang et al. 21] Focal Attention for Long-Range Interactions in Vision Transformers, NeurIPS 2021.
[Rao et al. 21] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification, NeurIPS 2021.
[El-Nouby et al. 21] XCiT: Cross-Covariance Image Transformers, NeurIPS 2021.
[Li et al. 22] MViTv2: Improved Multiscale Vision Transformers for Classification and Detection, CVPR 2022.
[Lee et al. 22] MPViT : Multi-Path Vision Transformer for Dense Prediction, CVPR 2022.

[Lee et al. 23] Sparse Token Transformer with Attention Back Tracking, ICLR 2023.
[Liu et al. 23] Transformers Learn Shortcuts to Automata, ICLR 2023.
[Bolya et al. 23] Token Merging: Your ViT But Faster, ICLR 2023.

Self-Supervised Learning

[Dosovitskiy et al. 14] Discriminative Unsupervised Feature Learning with Convolutional Neural Networks, NIPS 2014.
[Pathak et al. 16] Context Encoders: Feature Learning by Inpainting, CVPR 2016.
[Norrozi and Favaro et al. 16] Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles, ECCV 2016.
[Gidaris et al. 18] Unsupervised Representation Learning by Predicting Image Rotations, ICLR 2018.
[He et al. 20] Momentum Contrast for Unsupervised Visual Representation Learning, CVPR 2020.
[Chen et al. 20] A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020.
[Mikolov et al. 13] Efficient Estimation of Word Representations in Vector Space, ICLR 2013.
[Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019.
[Clark et al. 20] ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, ICLR 2020.
[Hu et al. 20] Strategies for Pre-training Graph Neural Networks, ICLR 2020.
[Chen et al. 20] Generative Pretraining from Pixels, ICML 2020.
[Laskin et al. 20] CURL: Contrastive Unsupervised Representations for Reinforcement Learning, ICML 2020.
[Grill et al. 20] Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning, NeurIPS 2020.
[Chen et al. 20] Big Self-Supervised Models are Strong Semi-Supervised Learners, NeurIPS, 2020.
[Chen and He. 21] Exploring Simple Siamese Representation Learning, CVPR 2021.
[Tian et al. 21] Understanding Self-Supervised Learning Dynamics without Contrastive Pairs, ICML 2021.
[Caron et al. 21] Emerging Properties in Self-Supervised Vision Transformers, ICCV 2021.
[Bao et al. 22] BEiT: BERT Pre-Training of Image Transformers, ICLR 2022.
[Bardes et al. 22] VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning, ICLR 2022.
[He et al. 22] Masked Autoencoders are Scalable Vision Learners, CVPR 2022.
[Liu et al. 22] Improving Contrastive Learning with Model Augmetnation, arXiv preprint, 2022.

[Touvron et al. 22] DeIT III: Revenge of the VIT, ECCV 2022.
[Garrido. et al. 23] On the duality between contrastive and non-contrastive self-supervised learning, ICLR 2023.
[Lee et al. 23] Self-Supervised Set Representation Learning for Unsupervised Meta-Learning, ICLR 2023.
[Park et al. 23] What Do Self-Supervised Vision Transformers Learn?, ICLR 2023.

Variational Autoencoders, Autoregressive and Flow-Based Generative Models

[Kingma and Welling 14] Auto-Encoding Variational Bayes, ICLR 2014.
[Sohn et al. 15] Learning Structured Output Representation using Deep Conditional Generative Model, NeurIPS 2015.
[Higgins et al. 17] beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework, ICLR 2017.
[van den Oord et al. 17] Neural Discrete Representation Learning, NeurIPS 2017.
[Razavi et al. 19] Generating Diverse High-Fidelity Images with VQ-VAE-2, NeurIPS 2019.
[Vahdat and Kautz 20] NVAE: A Deep Hierarchical Variational Autoencoder, NeurIPS 2020.
[Rezende and Mohamed 15] Variational Inference with Normalizing Flows, ICML 2015.
[Germain et al. 15] MADE: Masked Autoencoder for Distribution Estimation, ICML 2015.
[Kingma et al. 16] Improved Variational Inference with Inverse Autoregressive Flow, NeurIPS 2016.
[Oord et al. 16] Pixel Recurrent Neural Networks, ICML 2016.
[Oord et al. 16] Conditional Image Generation with PixelCNN Decoders, NeurIPS 2016.
[Dinh et al. 17] Density Estimation Using Real NVP, ICLR 2017.
[Papamakarios et al. 17] Masked Autoregressive Flow for Density Estimation, NIPS 2017.
[Huang et al.18] Neural Autoregressive Flows, ICML 2018.
[Kingma and Dhariwal 18] Glow: Generative Flow with Invertible 1x1 Convolutions, NeurIPS 2018.
[Ho et al. 19] Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design, ICML 2019.
[Chen et al. 19] Residual Flows for Invertible Generative Modeling, NeurIPS 2019.
[Tran et al. 19] Discrete Flows: Invertible Generative Models of Discrete Data, NeurIPS 2019.
[Ping et al. 20] WaveFlow: A Compact Flow-based Model for Raw Audio, ICML 2020.
[Chang et al. 22] MaskGIT: Masked Generative Image Transformer, CVPR 2022.

[Chen et al. 22] Learning Continuous Normalizing Flows for Faster Converegence to Target Distribution via Ascent Regularizations, ICLR 2023.
[Lipman et al. 23] Flow Matching for Generative Modeling, ICLR 2023.

Generative Adversarial Networks

[Goodfellow et al. 14] Generative Adversarial Nets, NIPS 2014.
[Radford et al. 15] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, ICLR 2016.
[Chen et al. 16] InfoGAN: Interpreting Representation Learning by Information Maximizing Generative Adversarial Nets, NIPS 2016.
[Arjovsky et al. 17] Wasserstein Generative Adversarial Networks, ICML 2017.
[Zhu et al. 17] Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, ICCV 2017.
[Zhang et al. 17] Adversarial Feature Matching for Text Generation, ICML 2017.
[Karras et al. 18] Progressive Growing of GANs for Improved Quality, Stability, and Variation, ICLR 2018.
[Choi et al. 18] StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation, CVPR 2018.
[Brock et al. 19] Large Scale GAN Training for High-Fidelity Natural Image Synthesis, ICLR 2019.
[Karras et al. 19] A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR 2019.
[Karras et al. 20] Analyzing and Improving the Image Quality of StyleGAN, CVPR 2020.
[Sinha et al. 20] Small-GAN: Speeding up GAN Training using Core-Sets, ICML 2020.
[Karras et al. 20] Training Generative Adversarial Networks with Limited Data, NeurIPS 2020.
[Liu et al. 21] Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis, ICLR 2021.
[Esser et al. 22] Taming Transformers for High-Resolution Image Synthesis, CVPR 2021.
[Hudson and Zitnick 21] Generative Adversarial Transformers, ICML 2021.
[Karras et al. 21] Alias-Free Generative Adversarial Networks, NeurIPS 2021.
[Lin et al. 22] InfinityGAN: Towards Infinite-Pixel Image Synthesis, ICLR 2022.
[Lee et al. 22] ViTGAN: Training GANs with Vision Transformers, ICLR 2022.
[Yu et al. 22] Vector-Quantized Image Modeling with Improved VQGAN, ICLR 2022.

[Huang et al. 22] Masked Generative Adversarial Networks are Data-Efficient Generation Learners, NeurIPS 2022.
[Yang et al. 22] Distilling Representations from GAN Generator via Squeeze and Span, NeurIPS 2022.
[Brooks et al. 22] Generating Long Videos of Dynamic Scenes, NeurIPS 2022.
[Wang et al. 23] Diffusion-GAN: Training GANs with Diffusion, ICLR 2023.

Diffusion Models

[Sohl-Dickstein et al. 15] Deep Unsupervised Learning using Nonequilibrium Thermodynamics, ICML 2015.
[Song and Ermon 19] Generative Modeling by Estimating Gradients of the Data Distribution, NeurIPS 2019.
[Song and Ermon 20] Improved Techniques for Training Score-Based Generative Models, NeurIPS 2020.
[Ho et al. 20] Denoising Diffusion Probabilistic Models, NeurIPS 2020.
[Song et al. 21] Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021.
[Nichol and Dhariwal 21] Improved Denoising Diffusion Probabilistic Models, ICML 2021.
[Vahdat et al. 21] Score-based Generative Modeling in Latent Space, NeurIPS 2021.
[Dhariwal and Nichol 21] Diffusion Models Beat GANs on Image Synthesis, NeureIPS 2021.
[De Bortoli et al. 22] Diffusion Schrodinger Bridge with Application to Score-Based Generative Modeling, NeurIPS 2021.
[Ho and Salimans 22] Classifier-Free Diffusion Guidance, arXiv preprint, 2022.
[Dockhorn et al. 22] Score-Based Generative Modeling with Critically-Damped Langevin Diffusion, ICLR 2022.
[Salimans and Ho 22] Progressive Distillation for Fast Sampling of Diffusion Models, ICLR 2022.
[Chen et al. 22] Likelihood Training of Schrodinger Bridge using Forward-Backwrad SDEs Theory, ICLR 2022.

[Cohen et al. 22] Diffusion bridges vector quantized variational autoencoders, ICML 2022.
[Ho et al. 22] Video Diffusion Models, NeurIPS 2022.
[Chen et al. 23] Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions, ICLR 2023.
[Liu et al. 23] Learning Diffusion Bridges on Constrained Domains, ICLR 2023.
[Chung et al. 23] Diffusion Posterior Sampling for General Noisy Inverse Problems, ICLR 2023.
[Chen et al. 23] Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding, CVPR 2023.
[Song et al. 23] Consistency Models, arXiv preprint 2023.

Large Language Models

[Shoeybi et al. 19] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, arXiv preprint, 2019.
[Lewis et al. 20] BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, ACL 2020.
[Raffel et al. 20] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, JMLR 2020.
[Gururangan et al. 20] Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, ACL 2020.
[Brown et al. 20] Language Models are Few-shot Learners, NeurIPS 2020.
[Rae et al. 21] Scaling Language Models: Methods, Analysis & Insights from Training Gopher, arXiv preprint, 2021.
[Thoppilan et al. 22] LaMDA: Language Models for Dialog Applications, arXiv preprint, 2022.
[Wei et al. 22] Finetuned Langauge Models Are Zero-Shot Learners, ICLR 2022.
[Wang et al. 22] Language Modeling via Stochastic Processes, ICLR 2022.
[Alayrac et al. 22] Flamingo: a Visual Language Model for Few-Shot Learning, arXiv preprint, 2022.
[Chowdhery et al. 22] PaLM: Scaling Langauge Modeling with Pathways, arXiv preprint, 2022.
[Wei et al. 22] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022.
[Touvron et al. 23] LLaMA: Open and Efficient Foundation Language Models, arXiv preprint, 2023.

[Ouyang et al. 22] Training Language Models to Follow Instructions with Human Feedback, NeurIPS 2022.
[Wang et al. 23] Self-Consistency Improves Chain of Thought Reasoning in Language Models, ICLR 2023.
[Rust et al. 23] Language Modelling with Pixels, ICLR 2023.
[Arora et al. 23] Ask Me Anything: A Simple Strategy for Prompting Langauge Models, ICLR 2023.
[Honovich et al. 22] Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, arXiv preprint, 2022.
[Wang et al. 22] Self-Instruct: Aligning Language Model with Self Generated Instructions, arXiv preprint, 2022.

Multimodal Foundation Models

[Socher et al. 13] Zero-Shot Learning Through Cross-Modal Transfer, NeurIPS 2013.
[Lu et al. 19] VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019.
[Chen et al. 20] UNITER: Universal Image-Text Representation Learning, ECCV 2020.
[Huang et al. 20] Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv preprint 2020.
[Li et al. 20] Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training, AAAI 2020.
[Radford et al. 21] Learning Transferable Visual Models From Natural Language Supervision, ICML 2021.
[Singh et al. 22] FLAVA: A Foundational Languageand Vision Asignment Model, CVPR 2022.
[Li et al. 22] BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, ICML 2022.
[Baevski et al. 22] data2vec: A General Framework for Self-supervised Learning in Speech, Vision, and Language, ICML 2022.
[Fei et al. 22] Towards artificial general intelligence via a multimodal foundation model, Nature Communications 2022.

[Alayract et al. 22] Flamingo: a Visual Language Model for Few-shot Learning, NeurIPS 2022.
[Wang et al. 22] Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks, arXiv preprint, 2022.
[Reed et al. 22] A Generalist Agent, arXiv preprint, 2022.
[Zeng et al. 23] Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, ICLR 2023.

Text-to-Image Synthesis

[Reed et al. 16] Generative Adversarial Text to Image Synthesis, ICML 2016.
[Li et al. 19] Controllable Text-to-Image Generation, NeurIPS 2019.
[Ramesh et al. 21] Zero-Shot Text-to-Image Generation, ICML 2021.
[Radford et al. 21] Learning Transferable Visual Models From Natural Language Supervision, ICML 2021.
[Ding et al. 21] CogView: Mastering Text-to-Image Generation via Transformers, NeurIPS 2021.
[Zou et al. 22] Towards Language-Free Training for Text-to-Image Generation, CVPR 2022.
[Rombach et al. 22] High-Resolution Image Synthesis with Latent Diffusion Models, CVPR 2022.
[Gu et al. 22] Vector Quantized Diffusion Model for Text-to-Image Synthesis, CVPR 2022.
[Nichol et al. 22] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, ICML 2022.

[Saharia et al. 22] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, arXiv preprint, 2022.
[Yu et al. 22] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation, arXiv preprint, 2022.
[Gafni et al. 22] Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, ECCV 2022.
[Chen et al. 23] Re-Imagen: Retrieval-Augmented Text-to-Image Generator, ICLR 2023.
[Poole et al. 23] DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023.
[Chang et al. 23] Muse: Text-To-Image Generation via Masked Generative Transformers, arXiv preprint, 2023.

Speech Representation Learning and Synthesis

[Oord et al. 16] WaveNet: A Generative Model for Raw Audio, arXiv preprint 2016.
[Baevski et al. 20] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, NeurIPS 2020.
[Tang et al. 22] Unified Speech-Text Pre-training for Speech Translation and Recognition, ACL 2022.
[Wang et al. 17] Tacotron: Towards End-to-End Speech Synthesis, Interspeech 2017.
[Shen et al. 18] Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions, ICASSP 2018.
[Chen et al. 19] Sample Efficient Adaptive Text-to-Speech, ICLR 2019.
[Hsu et al. 19] Hierarchical Generative Modeling for Controllable Speech Synthesis, ICLR 2019.
[Kumar et al. 19] MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis, NeurIPS 2019.
[Kong et al. 20] HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, NeurIPS 2020.
[Min et al. 21] Meta-StyleSpeech: Multi-Speaker Adaptive Text-to-Speech Generation, ICML 2021.
[Tang et al. 22] Unified Speech-Text Pre-training for Speech Translation and Recognition, ACL 2022.

[Hsu and Shi 22] u-HuBERT: Unified Mixed-Modal Speech Pretraining and Zero-Shot Transfer to Unlabeled Modality, NeurIPS 2022.
[Radford et al. 22] Robust Speech Recognition via Large-Scale Weak Supervision, arXiv preprint 2022.
[Kang et al. 23] Any-Speaker Adaptive Text-To-Speech Synthesis with Diffusion Models, ICASSP 2023.
[Ren et al. 23] Back of Tricks for Unsupervised Text-to-Speech, ICLR 2023.
[Lee et al. 23] BigVGAN: A Universal Neural Vocoder with Large-Scale Training, ICLR 2023.

Graph Representation Learning and Generation

[Hu et al. 20] Strategies for Pre-training Graph Neural Networks, ICLR 2020.
[You et al. 20] Graph Contrastive Learning with Augmentations, NeurIPS 2020.
[You et al. 21] Graph Contrastive Learning Automated, ICML 2021.
[Xu et al. 21] Self-supervised Graph-level Representation Learning with Local and Global Structure, ICML 2021.
[Thakoor et al. 22] Large-Scale Representation Learning on Graphs via Bootstrapping, ICLR 2022.
[Simonovsky and Komodakis 18] GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders, arXiv preprint 2018.
[You et al. 18] GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models, ICML 2018.
[Liao et al. 19] Efficient Graph Generation with Graph Recurrent Attention Networks, NeurIPS 2019.
[Zhang et al. 19] D-VAE: A Variational Autoencoder for Directed Acyclic Graphs, NeurIPS 2019.
[Niu et al. 20] Permutation Invariant Graph Generation via Score-Based Generative Modeling, AISTATS 2020.
[Guo et al. 22] Data-Efficient Graph Grammar Learning for Molecular Generation, ICLR 2022.
[Jo et al. 22] Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations, ICML 2022.
[Hoogeboom et al. 22] Equivariant Diffusion for Molecule Generation in 3D, ICML 2022.

[Kim et al. 22] Graph Self-Supervised Learning with Accurate Discrepancy Learning, NeurIPS 2022.
[Kim et al. 23] Pure Transformers are Powerful Graph Learners, NeurIPS 2022.
[Vignac et al. 23] DiGress: Discrete Denoising Diffusion for Graph Generation, ICLR 2023.
[Jo et al. 23] Graph Generation with Destiation-Driven Diffusion Mixture, arXiv preprint, 2023.

sjhwang82/GenerativeAI