Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction

Primary LanguagePython

Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction

Robot vision data can be thought of as first-person videos.

There are three possible situations in one egocentric video.

  1. Global - The global explains the overall situation including detailed information such as place, light,weather
  2. Action - The action explains what the subject, i.e.I, is doing.
  3. Interaction - The interaction explains the interacting situation or behavior between the subject, i.e. me,and others

Global Action Interaction(GAI)

alt tag


  • Python3.6
  • Tensorflow 1.5.0

How To Run

  1. Download UTEgocentric Dataset Dataset [Preprocess Dataset] (https://drive.google.com/file/d/1IlX_WosLWfqRnIGIobI9gipZ8EGOJIUz/view?usp=sharing)

  2. Extract Video Features

    $ python extract_RGB_feature.py
  3. Train model

    $ python train.py
  4. Test model

    $ python test.py


S2VT model by chenxinpeng

Vgg model by AlonDaks/tsa-kaggle

attention by [chenxinpeng] (https://github.com/AdrianHsu/S2VT-seq2seq-video-captioning-attention)

Dataset by [UTEgocentric] (http://vision.cs.utexas.edu/projects/egocentric/storydriven.html)