Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction
Robot vision data can be thought of as first-person videos.
There are three possible situations in one egocentric video.
- Global - The global explains the overall situation including detailed information such as place, light,weather
- Action - The action explains what the subject, i.e.I, is doing.
- Interaction - The interaction explains the interacting situation or behavior between the subject, i.e. me,and others
Global Action Interaction(GAI)
- Python3.6
- Tensorflow 1.5.0
-
Download UTEgocentric Dataset Dataset [Preprocess Dataset] (https://drive.google.com/file/d/1IlX_WosLWfqRnIGIobI9gipZ8EGOJIUz/view?usp=sharing)
-
Extract Video Features
$ python extract_RGB_feature.py
-
Train model
$ python train.py
-
Test model
$ python test.py
S2VT model by chenxinpeng
Vgg model by AlonDaks/tsa-kaggle
attention by [chenxinpeng] (https://github.com/AdrianHsu/S2VT-seq2seq-video-captioning-attention)
Dataset by [UTEgocentric] (http://vision.cs.utexas.edu/projects/egocentric/storydriven.html)