- SceneDetection quite slow for HD video's
- Optimize by downscaling video first
- Optimize by deleting the saved tensors
- Salsa dancing (full context needs to be known)
- See a table, difficult to know what its made of -> use sound to detect material
- See a child being attacked by a dog -> but when you hear the sound you know that they're playing
- Look at TimeSformer from facebook
- X-CLIP-BASE:
- Works great with short videos
- Works great if you want to find video
- Doesn't work great for long videos
- Doesn't work great to find a specific moment in a video
- Bottleneck: how to process long videos?
- Maybe use parallel processing
- Maybe use a different model
Given a video, we slice it into smaller videos based on the scene detection algorithm.
~ Q:
- What happens when the video is +1m long and there are no scene changes?
Given a scene, we extract the visual features from the video. When calculating the embedding we supply a list of 8 timestamps. These 8 timestamps resemble screenshots of the video. We're using the X-CLIP-BASE model to calculate the embedding based on the 8 screenshots. Given the embedding, we save it to the database.
Given a scene, we first detect the faces in the scene.