- Translating Videos to Natural Language Using Deep Recurrent Neural Networks NAACL-HLT 2015
- Sequence to Sequence – Video to Text iccv2015
- Learning Spatiotemporal Features with 3D Convolutional Networks iccv2015
- Describing Videos by Exploiting Temporal Structure iccv2015
- Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning iccv2016
- Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text EMNLP2016
- Jointly Modeling Embedding and Translation to Bridge Video and Language cvpr2016
- Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks cvpr2016
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language cvpr 2016
- Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks iccv2017
- RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos iccv2017
- Dense-Captioning Events in Videos iccv2017
- Learning Spatiotemporal Features with 3D Convolutional Networks cvpr2017
- Video Captioning With Transferred Semantic Attributes cvpr2017
- Weakly Supervised Dense Video Captioning cvpr2017
- Improving Interpretability of Deep Neural Networks with Semantic Information cvpr2017
- Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description cvpr2017
- Multi-Task Video Captioning with Video and Entailment Generation acl2017
- Reinforced Video Captioning with Entailment Rewards EMNLP2017
- Reconstruction Network for Video Captioning cvpr2018
- Video Captioning via Hierarchical Reinforcement Learning cvpr2018
- Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning cvpr2018
- M3 : Multimodal Memory Modelling for Video Captioning cvpr2018
- End-to-End Dense Video Captioning with Masked Transformer cvpr2018
- Less Is More: Picking Informative Frames for Video Captioning eccv2018
- Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning NAACL-HLT2018
- No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling acl2018
- Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning aaai2019
- Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning cvpr2019