Support `CutMix` for video data

Question

Support `CutMix` for video data

Closed this issue 2 months ago · 8 comments

innat commented 3 months ago

Short Description

Same as 2D cutmix, the requested feature is for 3D volume.

Papers

https://arxiv.org/abs/1905.04899

Existing Implementations

Other Information

Answer 1 · 2024-03-18T16:56:27.000Z

This should theoretically be supported since we allow any tensors of shape (..., H, W, C). I am guessing videos are just frames of images with shape (B, NUM_FRAMES, H, W, C). Not sure though, how consistent is this support across all the preprocessing layers but IMO, it should be an easy fix even if it's broken.

Answer 2 · 2024-03-18T16:58:10.000Z

I can see an argument for temporal consistency across frames when preprocessing but that seems like too much of a stretch from what KerasCV is designed to do. If we can treat frames independently, it would be much easier to add/advertise support for video data.

Answer 3 · 2024-03-18T18:07:55.000Z

The above gif is generated after adjusting some computation of the image-cutmix. If it's wanted, we can send a draft PR for evaluation.

Answer 4 · 2024-03-18T18:28:31.000Z

I don't have a strong opinion. At the first glance though, I'd be against adding this simply because you can always reshape the input tensors to make them work for videos:

import numpy as np
from keras import ops
import keras_cv
from keras_cv.layers import CutMix

videos = np.random.standard_normal((2, 5, 256, 256, 3)).astype(np.float32)
labels = ((np.random.random((2, 5)) > 0.5) * 1.).astype(np.float32)

B, F, H, W, C = tuple(videos.shape)
images = ops.reshape(videos, (B * F, H, W, C))
labels = ops.reshape(labels, (B * F))
augmented = keras_cv.layers.CutMix()({"images": images, "labels": labels})
augmented = augmented["images"]
augmented = ops.reshape(augmented, (B, F, augmented.shape[-3], augmented.shape[-2], augmented.shape[-1]))

augmented  # augmented videos.

Does this work for your use case @innat?

Answer 5 · 2024-03-18T19:16:20.000Z

I tried reshaping approach at first. In the above code example, there are couple of issue.

First, it becomes limited due to num_frames == num_classes. Second, introducing complexity for augmented labels. Third, cutmixing is happening on video_a from many video samples in a given timestep, which breaks the temporal consistency, IMO. Instead, cutmixing video_a and video_b makes such sense to me. (Same goes to MixUp).

Answer 6 · 2024-03-18T19:23:44.000Z

Again, no strong opinion. If you have a diff, feel free to propose. It would be also nice to first identify the layers where videos need to be treated differently. Like CutMix and MixUp. If you have a list, that'd be really helpful.

Answer 7 · 2024-04-02T01:48:39.000Z

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

Answer 8 · 2024-04-17T01:48:21.000Z

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.