tenaflyyy/Papers

Dense-Captioning Events in Videos

tenaflyyy opened this issue 6 years ago · 0 comments

tenaflyyy commented 6 years ago

Abstract

Domain
- The paper involves both detecting and describing events in a video.
Proposed method
- A proposal module designed to capture both short as well as long events that span minutes.
- A new captioning module (uses contextual information from past and future) to capture the dependencies between the events in a video
- ActivityNet Captions benchmark for dense captioning is introduced which contains 20k videos amounting to 849 video hours with 100k total descriptions 20k videos amounting to 849 video hours with100k total descriptions.

Detail

Introduction
- Dense-captioning events is analogous to dense-image captioning. It describes videos and localize events in time whereas dense-image-captioning describes and localizes regions in space.
- Dense captioning events comes with its own set of challenges distinct from the image case.
  - One observation is that events in videos can range across multiple time scales and can even overlap.
  - Another key observation is that the events in a given video are usually related to one another.
- To overcome the limitation of gradients vanishing for encoding long video sequences span minutes , the paper extend recent work on generating action proposals [V.Escorcia. Deep action proposals for action understanding] to multi-scale detection of events.
- A captioning module that utilizes the context from all the events from the proposal module to generate each sentence.
- ActivityNet Captions dataset contains 20k videos taken from ActivityNet. the dataset contains videos as long as 10 minutes, with each video annotated with on average 3.65 sentences.
Contributions
Experiments

Personal Thoughts