Dual Attention Networks for Multimodal Reasoning and Matching

Question

Dual Attention Networks for Multimodal Reasoning and Matching

chullhwan-song opened this issue 6 years ago · 2 comments

https://www.google.co.kr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwiOl5Pj19LUAhVKvLwKHVpoDdcQFggvMAE&url=https%3A%2F%2Farxiv.org%2Fpdf%2F1611.00471&usg=AFQjCNEkNnTcTYyq7AI9uFuQKDHom0ai1w
CVPR2017

Answer 1 · 2018-07-18T01:44:06.000Z

What ?
- Dual Attention Networks (DANs)
  - r-DAN : multimodal reasoning - VQA
  - m-DAN : multimodal matching - the similarity between images(의 attention 영역에 대한) and sentences.
    - fig.1만보고 생각되는 의문?
      - 밑의 그림처럼 attention 영역이 분리되면서, word와 매칭되는것인가?
      - 그렇다면, 어느 state에서 나타나는 것이가? (단어와 매칭 되는 영역(attention))
      - 아니면, state가 진행되는 과정에서 그때 마다 similarity 계산을 통해 각 단어와 연결된 prob가 높은 곳에 저렇게 표시되는 건가?
      - r-dan에서의 사각형과 m-dan에서의 원모양의 state는 특별한 의미가 있는가?
    - Image Attention영역 visualization?
      - CNN Feature Map 이 7x7이면 이미지 크기로 resize해서 실제 이미지와의 곱으로 표현.
Attention Mechanisms : models to focus on necessary parts of visual or textual inputs at each step of a task.
- VQA
- Image vs Text Matching
  - attends to the shared concepts between images and sentences.
    - cross-modal similarity : a single inner product operation. (word와 visual feature의 x 인가?)
  - 서로 다른 성격의 공간에서 뽑은 feature간에 similarity를 구할수도 없으니 이 두 feature를 같은 공간안에 embedding을 하여 similarity를 구하려는 연구
Dual Attention Networks (DANs)
- Input Representation
  - Image representation : 448x448
    - the last pooling layer of 19-layer VGGNet(pool5) or 152-layer ResNet(res5c)
    - 최종 의 형태로 되고, N은 예를 들어, 512(vgg), 2048(resnet).
      - 이들은 n-region과 대응된다.(??)
        
        3차원일텐데? 어떤 j(<n) region은 일종의 row 값을 가지는건가? 512(=wxh)xRow 이런식?
        
        이 의미는 v1..vn은 각각 즉, v1은 512차원의 vector를 의미
  - Text representation
    - bidirectional LSTMs 채용.
  - 이 과정을 통해 embedding 된 real input으로써 적용하는건가?
  - one-hot encoding of T input words :
    - xt = MxWt 으로 embedding된 vector, M 은 embedding 행렬
      - h(f)은 lstm forword, h(b) lstm backword, t는 각 time state
        
        이때 아웃풋은 위의 그림에서 처럼, 두 값을 더한 값(concat)이 각 state에서의 output이 된다.
        
        즉, 죄종적으로 이 output이 DAN에 쓰이는 text feature vector가 된다.
    - 이 과정(the word embedding matrix and the LSTMs)은 실제론 각각이 아닌 end to end
- Attention Mechanisms
  - Visual Attention : attending <- certain parts of the input image
    - V(k)는 k step에서의 visual context vector, (k=0일때,는 vgg or resnet feature 인가?)
    - M(k-1)은 k-1 step까지의 attention정보를 가진 memory vector
      - soft attention mechanism
      - weighted average of input feature vectors -> V의 평균값인가?
        
        논문의 흐름상 살짝 뜬금(?)없이 식3(바로 위의 수식)을 설명하다가 식4~6이 나옴.(ㅠ)
        
        즉, 식3을 설명하는것 같다가, 갑자기 식5의 attention weights란 설명함.
        
        식(3)과 식(6)은 동치인듯.
        
        V(k)를 구하기 위해 위의 수식 4~6, 즉 4->5->6 유도
        
        : hidden state
        
        : attention weights
        
        2 layer feed forward neural network(FNN)에 의해 구성(수식 4) 그리고 softmax(수식5)
        
        embed된 visual context vector인데, textual context vectors 공간과 호환하게 만들기 위한 weight값이이라고 설명해야하나.?
        
        저자한테 직접물어보니, 학습시킴, 즉 수식6이 nn이니 그 weight값이 됨
    - the visual context vector
      : embed된 visual context vector인데, textual context vectors 공간과 호환.
    - 앞서 나가는 그림이지만(Fig.3), 위의 3~6 수식은 이 그림에서 visual attention(적색 사각형) 이부분을 나타내는 듯.
      - visual attention 파트만 보면 m과 v를 입력값으로 받는것으로 보아 맞는듯.
    - 위에서 M은 weighted average of input feature vectors 라고 적혀 있는데, paper에서의
      figure.3(바로 위의 그림)과 figure 4(not yet)를 보면 memory vector는 visual feature(fig.4, m-DAN) 뿐만 아니라, visual/text feature의 혼합(fig.3, embedding, r-DAN, ?) 된 값을 받는듯.
  - Textural Attention
    - Visual Attention과 유사함.
      는 memory vector
      - 마찬가지로, attention weight 는 2-layer FNN의 학습을 통해 획득하고 (수직 8~9)
      - 다만, context vector 는 단지 weighted average하여 구함.
        
        visual attention은 layer를 하나두고 했지만, 여기서는 X
      - 마찬가지로, network parameter(w)
      - 는 hidden state
TASK
- r-DAN
- joint memory vector : 은 visual & text 정보를 합한다.
- 앞서 언급한 figure 3은 r-DAN을 의미하며, 이 그림에서, 과
  이 바로 이것을 의미함을 알수 있음.
  - 이는, 수식 3, 4, 7, 8에 적용된 memory vector를 의미하기도 한다.
  - initial memory vector 은 다음 식에 의해
  - 근데 initial은 random인가? global ??
    - V(0)는 마지막 cnn layer의 average pooling값 - 즉, 각 채널의 평균값들을 의미
    - U(0)는 (Figure2에서 보면) RNN의 u_1~u_r 의 (output값의) 평균
      - 수식 3과 7 의 반복하여 각 step(K)에서 업데이트
        - fig.3은 K=2
        - 마지막 Answer 부분
  - single-layer softmax classifier with cross-entropy loss.
  - 은 후보 answer에 대한 확률
- m-DAN
  - jointly learns visual and textual attention models to capture the shared concepts between the two modalities,
  - 위의 상단에 fig.1를 보고 "r-dan에서의 사각형과 m-dan에서의 원모양의 state는 특별한 의미가 있는가?" 질문을 했는데, 그렇게 표현한 이유가 존재함.
  - 즉, r-DAN에서는 입력 두개의 memory를 joint해서 사용했지만(수식.11), 여기서는 분리하여 사용.
  - 구조
  - 은 r-DAN과 동일
  - 각 step에서, 는 visual와 text의 similarity를 의미
    - 그래서, 수식 16, 17처럼 분리(visual와 text)
  - 이를 각 step에서 업데이트,
  - K(step) = 2
  - training for loss function
    - bidirectional max-margin ranking loss = metric learning
      - : positive pair set
      - : negative pair set
      - m = 100
    - inference
    - 이고, 식에 의해 similarity(inner product) 계산, 수식 19와 동치
    - zv and zu are the representations for image v and sentence u
      - cf) these vectors are obtained via separate pipelines of visual and textual attentions
성능
- http://visualqa.org/roe.html

Answer 2 · 2019-07-31T05:50:00.000Z

bert 적용 고민 연구 : https://github.com/Hxx2048/cv-homework