How are you generating ground truth for narrations?
thechargedneutron opened this issue · 4 comments
Thanks for the good work. I have a simple question regarding data processing.
For the video clips, the CrossTask dataset has annotations. For example, the annotations for 113766_JFnZHAOUClw.csv
is as follows:
1,40.51,44.21
2,46.43,48.93
2,51.44,52.84
3,65.4,68.4
4,76.12,77.92
6,78.25,82.65
3,89.24,91.14
4,98.59,100.29
8,118.06,121.06
10,121.92,126.22
8,127.71,130.61
10,133.72,137.72
which means season steak
happens (visually) from 40.51 to 44.21 and so on.
But for the narrations, how are you mapping key-steps to narrations? In table 1 in the paper, you have shown the mapping in bold, but I could not find a reference of how to achieve that in the code? Can you please point me to that? I need ground truth narrations mapped to the key-steps for my research purpose.
For table 1 in the paper, we manually mark the alignment between extracted verb phrases in narrations and the ground truth key-steps. We don't have a code to achieve that. We only manually mark the key-steps in narrations for a few videos.
We tried to compare the similarity between semantic embeddings of narrations and key-steps to localize the key-steps, which may give some meaningful alignments, but it seems that the alignment is not perfect and difficult to quantify the quality.
Makes sense. Just to confirm my understanding, do you manually mark the key-steps in narrations that is shown in Figure 1 as well?
Yes, you are correct.
Thanks!