vitrivr/vitrivr-engine

Extraction with multiple enumerators or multiple decoders

Closed this issue · 3 comments

For the use-case of extracting ASR features at a more course-grained level than clip features, it is useful to be able to define a pipeline where either:

  1. a video is decoded at a very fine grained level, whose segments are then differently grouped by differently configured transformers.
  2. a video is decoded twice, once at a fine grained level and once at a course grained level
  3. at the very least the video file is enumerated twice, leading to two seperate video source retrievables

My impression is that maybe 1 is more useful down the line, but 2 and 3 should be quite easy to implement but are currently not fully supported (or at least I haven't figured out how to implement them).

video-multiple-decoders.json
This pipeline decodes videos twice and properly creates temporal metadata descriptors and segments of different granularity, however, it does not persist anything. When I remove "long-decoder-stage" from the input of "time-stage", then the pipeline only uses a single decoder and ends up persisting everything properly.

Digging into this, in line 60 in IngestionPipelineBuilder (commit ce53093) the enumerator is not checked to have multiple outputs (as is the case for other operators in line 111) and, if necessary, wrapped in a broadcast operator. A simple fix (checking and wrapping) does not work, as the decoder expects an Enumerator as input and a BroadcastOperator is not an Enumerator.

Point 3, using multiple enumerators, also doesn't seem to work.
video-multiple-enumerators.json this pipeline gives Dangling operators are not supported

As a minor side note: it would make sense if the file metadata extraction would already work immediately after enumeration, but currently a decoding stage seems to be necessary. This is probably not a big issue in practice, though.

Only option 1 sounds reasonable to me. Generally, having multiple enumerators does not make a lot of sense. Having multiple decoders only makes sense if you have a mixed collection with multiple media types. Decoding the same document multiple times will always be less efficient than decoding it once, so that is what should be done whenever possible.

Generally I agree. @ppanopticon mentioned today that points 2 and 3 are probably supported and a good interim solution and this issue was also intended as a reply to this. Nevertheless, silently failing and not persisting anything is strange behaviour.

Update: it turns out that with video-multiple-decoders.json I was incorrectly using COMBINE when I should have used MERGE. Switching to MERGE resolves this issue.