gulvarol/bsl1k

Extracting BSL-1k clips from BOBSL

hshreeshail opened this issue · 5 comments

This issue is in reference to extracting the video clips of individual signs from BOBSL that form the BSL-1k dataset. In mouthing_spottings.json, the global_times annotation is a single timestamp value (instead of a (start, end) range). How, do I extract the corresponding clip from this? Are all the clips of same length?

Just read section A.3 from the appendix. So, can I assume that the timestamped frame is the last frame of the clip and take 24 frames before it?
P.S: I am assuming that the mouthing_spottings.json file in the BOBSL dataset corresponds to BSL-1K

Couple of more queries:
1] Why are the global_times annotations in seconds rather than the frame number? Is it to allow for different frame rates?
2] For the default setting with frame_rate = 25, if a timestamp is sss.mmm (seconds and milliseconds), shoudn't the milliseconds part be a multiple of 40 (=1000/25)? But the values from the annotations file do not satisfy this property.

Thank you for your questions. We’ve addressed them below, please let us know if anything is unclear:

(1) BOBSL vs BSL1K - Although BOBSL and BSL1K are constructed in a similar manner, they cover different sign language interpreted TV shows, and therefore contain different annotations. BSL1K is described in the paper "BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues", ECCV'20, and is not released. BOBSL is released and is described in the paper "BOBSL: BBC-Oxford British Sign Language Dataset", arXiv'21.

(2) Annotation is a single timestamp. - Yes, we only record a point in time that gives maximum response over a search timeline. Since these are automatically mined annotations, we do not have accurate start/end times. We experimentally determine a fixed clip length around these points based on the annotation type. See the next point.

(3) How to extract clips from the annotations? Are all the clips of the same length? - For the original annotations from spottings.tar.gz, the windows around mouthing M, dictionary D and attention A times should be: [-15,4], [-3,22] and [-8, 18] respectively. New and more annotations for BOBSL can be downloaded from the ECCV'22 paper. From experimenting with different windows around annot_time key, we find the following to work best: M* [-9, 11], D* [-3, 22], P [0, 19], E [0, 19], N [0, 19]. Please find details on these annotations in the paper in this link. We randomly sample 16 contiguous frames from these windows for training, and perform sliding window averaging for test. Moreover, find the helper script at misc/bsl1k/extract_clips.py which you would need to modify by setting --num_frames_before and --num_frames_after arguments.

(4) Why are the global_times annotations in seconds rather than the frame number? Is it to allow for different frame rates? Yes - you should be able to find the frames easily.

(5) For the default setting with frame_rate = 25, if a timestamp is sss.mmm (seconds and milliseconds), shoudn’t the milliseconds part be a multiple of 40 (=1000/25)? - We’ve rounded the times to 3dp.

Thanks for the clarification. Is there any estimate on if/when BSL-1k will be released? Thank you.

BSL-1K will not be released, sorry for the outdated repository. We have released BOBSL instead by reproducing all the papers where we had used BSL-1K.