tenaflyyy/Papers

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Opened this issue 7 years ago · 0 comments

tenaflyyy commented 7 years ago

Abstract

Proposed Method
- A combined bottom-up and topdown visual attention mechanism is proposed.
- The bottom-up mechanism (Using Faster R-CNN) proposes a set of salient image regions, with each region represented by a pooled convolutional feature vector.
- The top-down mechanism uses task-speciﬁc context to predict an attention distribution over the image regions.
- The feature glimpse is computed as a weighted average of image features over all regions.
- Performance 👍 achieving 70.2% overall accuracy on the VQA2.0.

Details

Personal Thoughts