As a video clip often includes rich content and semantic details, different people may be interested in different views. Thus a caption generated by an existing video captioning method always fails to meet the ad hoc expectations. We propose a new task called Video Interactive Captioning (ViCap), i.e., when the initial caption generated by a captioning agent does not meet human expectation, we could launch a round of interaction between a human and the agent and give a short prompt to the agent. Then the agent could generate a more accurate caption based on the prompt. As shown below, we show an example. A human and a ViCap agent are interacting. First, the agent generates an initial description of a video clip (the blue words). As the human may be interested in certain frames of the clip, the description fails to meet his expectation. Then he gives a short prompt, i.e., the first two words of the expecting caption (the purple words). Finally, based on the prompt, the agent generates a more accurate caption (the red words).
From a practical view, a human could provide any prompts in any form, e.g., in different length. And the prompts could appear anywhere in the generated captions. In this paper, we move forward a first step in that we set the prompts to be the first two words of the generating caption. Thus this is a task which is different from traditional captioning tasks, which may support various applications, such as video retrieval, video chat robot, and even human-robot interaction. Here, we devise the ViCap to include a video encoder, an initial caption encoder, and a refined caption generator. Moreover, we show that the ViCap can be trained via a full supervision way, i.e., supervised with the ground-truth captions, or a weak supervision way, i.e., weakly supervised with only the prompts.
We use a GRU unit to encode the video and the initial caption. Then we devise an improved GRU decoder and a convolutional decoder. Particularly, for the convolutional decoder, we stack multiple dilated convolutional layers followed by gated activation units, the goal of which is to capture dependencies among frame and word sequences. As shown below, we illustrate the structure of the convolutional decoder which consists of five dilated convolutional layers. The dilated rate is respectively set to 1, 1, 2, 4, 2. The green words are the ones generated by the decoder. By the hierarchical convolutional architecture, the decoder could capture variant dependencies among the sequence.
As shown below, we take the GRU decoder as an example to illustrate the difference between these two training ways, i.e., during the training of full supervision (the left part), at each time step, we take the ground-truth word (the red word) as the input of decoder to predict the next word. Whereas, during the training of prompt supervision (the right part), besides the prompt words (the purple words), we only take the previous output (the green word) of the decoder as the current input of the decoder.
For the weak supervision with only prompts, we devise a method which uses the output of the decoder to reconstruct the input of the decoder. As shown below, we illustrate the structure of GRU reconstruction network.
As shown below, we illustrate the structure of convolutional reconstruction network. And we stack three dilated convolutional layers as the reconstruction network. The dilated rate is respectively set to 1, 2, 2.
We extend MSRVTT-2016 dataset. And we take the 1st to 4500th video clips of the dataset as the training set of the pre-trained model and use the 4501st to 5000th video clips of the dataset as the validation set. For ViCap models, we take the 5001st to 8500th video clips of the dataset as the training set. And we take the 8501st to 9000th video clips as the validation set and use 9001st to 10000th video clips as the test set. Besides, for each clip of the validation and test set of ViCap models, we give three different semantic sentences as the annotations. Finally, when we evaluate the performance of ViCap models, for each one of the annotations in the test set, we remove the first two words and take the remaining words as the ground-truth of evaluation.
Examples: Videos we choose to show here are short limited the the size constraint of GIF in github. And these videos are selected from the test set. "Initial" denotes the initial caption generated by the S2VT model. And for each one of these videos, we give three different semantic annotations.
The following table shows the performance of full supervision. We can see that the performance of convolutional decoder outperforms the GRU decoder.
For the case of full supervision, we make some ablation analysis. The results are shown in the following three tables.
The following table shows the performance of prompt supervision. We can see that the performance of the convolutional reconstruction network outperforms that of the GRU reconstruction network.
For the case of prompt supervision, we make some ablation analysis. The results are shown in the following two tables.
Qualitative results generated by our convolutional decoder. And the model is trained in the form of full supervision. "Initial" indicates the initial caption generated by the S2VT model. "GT" indicates the ground-truth. The red words indicate the final generated words.