Motion-Appearance Co-Memory Networks for Video Question Answering
tenaflyyy opened this issue · 0 comments
tenaflyyy commented
Abstract
- Three unique attributes of videoQA compared with image QA
- deals with long sequences of images.
- motion and appearance information are usually correlated with each other and provide useful attention cues.
- different questions require different number of frames to infer the answer.
- Proposed Method
- Build on concepts from Dynamic Memory Network and introduces new mechanisms.
- Three salient aspects:
- utilizes cues from both motion and appearance to generate attention.
- a temporal conv-deconv network to generate multi-level contextual facts.
- a dynamic fact ensemble method to construct temporal representation dynamically for different questions.
- Datasets
- TGIF-QA dataset.
- the results outperform state-of-the-art significantly on all four tasks of TGIF-QA.
Details
-
Introduction
- The model is built on concepts of DMN/DMN+, share the same terms with DMN , such as facts, memory and attention
- a video is converted to a sequence of motion and appearance features by the two-stream models [arxiv:1608.00797] . The motion and appearance features are then fed into a temporal convolutional and deconvolutional neural network to build multi-level contextual facts.
- These contextual facts are used as input facts to the memory networks.
- The co-memory networks hold two separate memory states, one for motion and one for appearance.
- a co-memory attention mechanism takes motion cues for appearance attention generation, and appearance cues for motion attention generation.
- dynamic fact ensemble method to produce temporal facts dynamically at each cycle of fact encoding.
-
Contributions
-
Experiments