Masterproef

Learnings

SceneDetection quite slow for HD video's
Optimize by downscaling video first
Optimize by deleting the saved tensors

Difficult

Salsa dancing (full context needs to be known)
See a table, difficult to know what its made of -> use sound to detect material
See a child being attacked by a dog -> but when you hear the sound you know that they're playing

Alternatives

Look at TimeSformer from facebook

Learnings

X-CLIP-BASE:
- Works great with short videos
- Works great if you want to find video
- Doesn't work great for long videos
- Doesn't work great to find a specific moment in a video
- Bottleneck: how to process long videos?
  - Maybe use parallel processing
  - Maybe use a different model

Workings

1. Uploading

Given a video, we slice it into smaller videos based on the scene detection algorithm.

~ Q:

What happens when the video is +1m long and there are no scene changes?

2. Indexing

Visual

Given a scene, we extract the visual features from the video. When calculating the embedding we supply a list of 8 timestamps. These 8 timestamps resemble screenshots of the video. We're using the X-CLIP-BASE model to calculate the embedding based on the 8 screenshots. Given the embedding, we save it to the database.

Face

Given a scene, we first detect the faces in the scene.

michielswaanen/extendable-search

Masterproef

Learnings

Difficult

Alternatives

Learnings

Workings

1. Uploading

2. Indexing

Visual

Face

Audio