tenaflyyy/Papers

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Opened this issue · 0 comments

Abstract

  • Proposed Method
    • A combined bottom-up and topdown visual attention mechanism is proposed.
    • The bottom-up mechanism (Using Faster R-CNN) proposes a set of salient image regions, with each region represented by a pooled convolutional feature vector.
    • The top-down mechanism uses task-specific context to predict an attention distribution over the image regions.
    • The feature glimpse is computed as a weighted average of image features over all regions.
    • Performance 👍 achieving 70.2% overall accuracy on the VQA2.0.

Details

Personal Thoughts