Image-Captioning-with-Transformer

Transformer_Captioning_CS231N.ipynb

이전에 학습하고 구현했던 Transformer(Decoder) 를 이용하여 CS231N(2021) 에서 제공하는 dataset 과 utility 를 활용하여 image captioning.

ViT 는 patch 로 나누어 Transformer 을 적용하고 이 patch 들과 caption의 word들 에 attention 을 적용하여 시각화 하면 각 word 마다 attention 하는 patch 를 볼 수 있지 않을까 라는 궁금즘으로 시도해 보았다.

MS COCO dataset 을 이용하여 전처리한 뒤, 이전에 학습하고 구현했던 Transformer(Decoder) 와 Vision Transformer(Encoder) 를 이용하여 image captioning 및 Attention 시각화.

GT : A person is in mid air on a snowboard.

GN : (start) the person is (unk) skiing through the mountains . (end)

Mean Attention

All Attention (4 heads)

the

person

skiing

through

the

mountains

GT : A woman that is on a tennis court with a racquet.

GN(Gnerated Caption) : woman that is on a tennis court with a racquet .

data set 이 너무 크고 caption 이 너무 다양해서 val set 으로 나누지 않고 overfitting 시켜 관찰하였다. 추후에 전체 학습 데이터를 학습시켜 overfitting 을 줄이고 일반화 된 것들을 관찰할 것.

Transformer_Captioning_CS231N.ipynb 의 경우

CS231N 에서 제공하는 MS COCO dataset, utility, image feature 를 사용하였다. overfitting 되어, val set 에서는 낮은 정확도를 보인다.

Image_Captioning_Attention_ViT.ipynb 의 경우

MS COCO dataset 을 download 하여 caption 을 preprocessing 하였다.
Pretrained ViT 로 image feature 를 patch 마다 생성하고
생성한 patch 마다의 feature 들과 caption 을 attention 을 활용하여 학습하였다. (Transformer Decoder 이용)
caption 의 word 마다 어떠한 patch 를 attention 하는 지 시각화 하였다.

(현재 보인 attention 은 그나마 잘 된 케이스이고 일반화가 덜 되어있다. word 마다 특정 patch 에 attention 하는 detail 이 부족하다. 추후에 개선.)

preprosessing 은

전체 dataset 학습 및 일반화

attention 과 original image 를 결합한 시각화