google-deepmind/perception_test

Handled object annotations

samuele-ruffino opened this issue · 7 comments

Hello. I am mostly interested in handled object instead of all of them inside the scene for object tracking. Have you structured the annotations in a way that is possible to get the handled object only? Instead of tracking the whole pool of object.
Thank you!

Hi,

In the action and sound annotations we have a label 'parent_objects' which relates each action/sound to object track IDs. It should be possible to get handled objects only using this data.

Does this answer your question? Thanks!

Great! Could work. Is the first one (since there might be multiples) always the handles objects or the order does not matter? Moreover, are starting frame ids for tracking and action frame ids temporally consistent? Because i would like to create subsequences for each handled object specifically

viorik commented

All the object ids that appear under 'parent_objects' for an action segment are involved in the corresponding action in the time interval when that action happens, e.g. for an action like "taking something out of something", there should be 2 parent objects corresponding to the 2 "somethings" in the action template name. We instructed the annotators to put the object ids in the order in which they appear in the action name, but this might be a bit noisy; for example, if a person takes a book out of a backpack, the action segment should have as parent_objects: id_book, id_backpack.

The same applies for sounds.

Hope this answers your question?

Ok great! It does work. Maybe one last point, I am trying to use both SOT and action annotation files 9valid set), but apparently there are missing videos for one task and vice-versa. Should they coincide or it could be ? Because I would like to start from same benchmark of SOT challenge.

viorik commented

All videos have object tracks annotations. It is possible that some videos do not appear in the actions set because they don’t have action annotations. For example, there are some change detection videos where the camera is looking at a table with some objects on it, then the camera looks away, then comes back to the table after a few seconds. Some actions may have happened in the meantime involving the objects on the table, but they were not seen by the camera, hence no action annotations.

But for the SOT challange specifically, is the challange benchmark (sot_valid_annotations_challenge2023.zip, 1000 videos) a subset of temporal action localization one (challenge_action_localisation_valid_annotations.zip)? Because I would like to prepocess the first data based on info that I get from the respective, but I am not sure if the two sets are overlapping or not (different subsets from valid set). Thank you!

viorik commented

The overlap is not guaranteed. The 1000 videos for SOT challenge have been sampled randomly from the entire validation set, independent of the action challenge. But you can access the entire training and validation sets for SOT using this repository, and in that way, all action videos will have corresponding SOT videos.