1 주차 |
VLM bechmark and metric |
VLM 관련 벤치마크와 메트릭 소개 |
|
|
강재욱 |
youtube1, youtube2 |
2 주차 |
Vision transformer |
ViT:An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale |
Google |
2020 Oct |
이인규 |
youtube |
3 주차 |
Dual encoder |
CLIP: Learning Transferable Visual Models From Natural Language Supervision |
OpenAI |
2021 Feb |
김희은 |
youtube |
4 주차 |
Image-text matching |
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers |
MS |
2020 Apr |
신성호 |
youtube |
5 주차 |
Image-text contrastive learning |
ALBEF: Align before Fuse: Vision and LanguageRepresentation Learning with Momentum Distillation |
Salesforce |
2021 Jul |
이유경 |
youtube |
6 주차 |
Masked Image Modeling |
BEiT: BERT Pre-Training of Image Transformers |
MS |
2021 Jun |
박민지 |
|
7 주차 |
Masked VLM |
Masked Vision and Language Modeling for Multi-modal Representation Learning |
Amazon |
2022 Aug |
김강민 |
youtube |
8 주차 |
Multimodal funsion by MoE |
VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts |
MS |
2021 Nov |
백혜림 |
youtube |
9 주차 |
Multimodal funsion by merged attention |
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision |
Google |
2021 Aug |
정윤성 |
youtube |
10 주차 |
Multimodal funsion by co-attention |
CoCa: Contrastive Captioners are Image-Text Foundation Models |
Google |
2022 May |
김승우 |
youtube |
11 주차 |
Few-shot learning in VLM |
Flamingo: a visual language model for few-shot learning |
DeepMind |
2022 Apr |
조성국 |
youtube |
12 주차 |
Model scaling for VLM 1 |
GIT: A Generative Image-to-text Transformer for Vision and Language |
MS |
2022 May |
김기범 |
youtube |
13 주차 |
Model scaling for VLM 2 |
PaLI: A Jointly-Scaled Multilingual Language-Image Model |
Google |
2022 Sep |
이영수 |
|
14 주차 |
wrap-up |
전체 흐름 재정리 |
|
|
강재욱 |
|