keras-team/keras-cv

Support `CutMix` for video data

Closed this issue · 8 comments

Short Description

Same as 2D cutmix, the requested feature is for 3D volume.

Papers

https://arxiv.org/abs/1905.04899

Existing Implementations

Other Information

This should theoretically be supported since we allow any tensors of shape (..., H, W, C). I am guessing videos are just frames of images with shape (B, NUM_FRAMES, H, W, C). Not sure though, how consistent is this support across all the preprocessing layers but IMO, it should be an easy fix even if it's broken.

I can see an argument for temporal consistency across frames when preprocessing but that seems like too much of a stretch from what KerasCV is designed to do. If we can treat frames independently, it would be much easier to add/advertise support for video data.

The above gif is generated after adjusting some computation of the image-cutmix. If it's wanted, we can send a draft PR for evaluation.

I don't have a strong opinion. At the first glance though, I'd be against adding this simply because you can always reshape the input tensors to make them work for videos:

import numpy as np
from keras import ops
import keras_cv
from keras_cv.layers import CutMix

videos = np.random.standard_normal((2, 5, 256, 256, 3)).astype(np.float32)
labels = ((np.random.random((2, 5)) > 0.5) * 1.).astype(np.float32)

B, F, H, W, C = tuple(videos.shape)
images = ops.reshape(videos, (B * F, H, W, C))
labels = ops.reshape(labels, (B * F))
augmented = keras_cv.layers.CutMix()({"images": images, "labels": labels})
augmented = augmented["images"]
augmented = ops.reshape(augmented, (B, F, augmented.shape[-3], augmented.shape[-2], augmented.shape[-1]))

augmented  # augmented videos.

Does this work for your use case @innat?

I tried reshaping approach at first. In the above code example, there are couple of issue.

First, it becomes limited due to num_frames == num_classes. Second, introducing complexity for augmented labels. Third, cutmixing is happening on video_a from many video samples in a given timestep, which breaks the temporal consistency, IMO. Instead, cutmixing video_a and video_b makes such sense to me. (Same goes to MixUp).

Again, no strong opinion. If you have a diff, feel free to propose. It would be also nice to first identify the layers where videos need to be treated differently. Like CutMix and MixUp. If you have a list, that'd be really helpful.

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.