Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Opened this issue · 0 comments
tenaflyyy commented
Abstract
- Proposed Method
- A combined bottom-up and topdown visual attention mechanism is proposed.
- The bottom-up mechanism (Using Faster R-CNN) proposes a set of salient image regions, with each region represented by a pooled convolutional feature vector.
- The top-down mechanism uses task-specific context to predict an attention distribution over the image regions.
- The feature glimpse is computed as a weighted average of image features over all regions.
- Performance 👍 achieving 70.2% overall accuracy on the VQA2.0.