asteroid-team/torch-audiomentations

Process (various kinds of) target data similarly to input data

iver56 opened this issue ยท 14 comments

The idea is that whatever perturbations are applied to the input are reflected in the target data

Scenarios:

Input Target
Audio class(es), fixed length, not correlated with the input length. Low priority.
Audio Time series of class(es). E.g. predict class for each frame of 25 ms with hop size 10 ms. Medium priority.
Audio Audio (length same as the input). This one has the highest priority

Nice! This is definitely THE feature that would make me switch to torch-audiomentations, and more specifically the one with target "Time series of class(es)" which would come very handy for my work on speaker diarization.

We should make examples of what the API should look like in each scenario

Yeah, I agree that this would be super cool, but how would it look like? @hbredin do you have API ideas in mind?

The last one, "Audio (length same as the input)" is supported already by doing this:

augment = ExampleTransform(...)

augment(samples, sample_rate)
augment.freeze_parameters()
augment(other_samples, sample_rate)
augment.unfreeze_parameters()

We can refine the API for this scenario later

I am currently working on an augmentation technique that basically sums two samples selected randomly from within a batch, for training end-to-end speaker diarization models. This falls under the "Time-serie of class(es)" scenario.

What I am missing is a way to keep track of targets (in my case, frame-wise speaker activations). The API would look like this:

augment = SumTwoSamples(min_snr_in_db=0.0, max_snr_in_db=5.0)
augmented_samples, augmented_targets = augment(samples, sample_rate, targets=targets, target_rate=target_rate)

where

  • samples has shape (batch_size, [num_channels,] num_samples)
  • targets has (batch_size, [num_channels,] num_frames, num_classes)-shaped
  • augmented_samples has shape (new_batch_size, [num_channels,] num_samples)
  • augmented_targets has shape (new_batch_size, [num_channels,] num_frames, num_classes)

I am trying to somehow bend BaseWaveformTransform so that it accepts these new target* optional arguments but it would probably make sense to create a new dedicated base class (e.g. BaseWaveformTransformWithTargets) for that purpose.

In particular, it is not clear to me what per_samples, per_channel, per_batch would mean for such transforms.

What are your thoughts?

cc @FrenchKrab

Interesting!

I get a feeling that it could blow up in complexity if we want to cover all use cases in one class. Maybe it makes sense to make separate "variants" of each transform? Or should we try to handle the various cases in one class?

I had a quick look at your draft PR, and my first thought was that it introduces breaking changes in all transforms (they return a tuple instead of a tensor), which then again led me to check out how computer vision data augmentation libs do it. For example, albumentations has the power to process a target (image/mask, bounding boxes or set of points) in the same way as the input image. In albumentations each transform returns a dictionary, which leaves some flexibility in what is included in the returned value. Maybe that's a useful pattern to adopt here?

# basic example without target
transformed = transform(image=image)
transformed_image = transformed["image"]

# basic example with a target that is a mask image
transformed = transform(image=image, mask=mask)
transformed_image = transformed['image']
transformed_mask = transformed['mask']

In particular, it is not clear to me what per_samples, per_channel, per_batch would mean for such transforms.

Could you elaborate on what feels unclear?

Anyway, I think it would make sense to first design the API, then implement it for a simple transform, like Shift or TimeInversion. Also, if we're going to make backwards-incompatible (breaking) changes, it would be nice to release a version with future warnings first, even though we're still in alpha ๐Ÿค“

I get a feeling that it could blow up in complexity if we want to cover all use cases in one class. Maybe it makes sense to make separate "variants" of each transform? Or should we try to handle the various cases in one class?

We should probably start by listing the type of targets used in audio.
I personally can't think of any that cannot be shaped as one of the three you mentioned in the original issue.

I had a quick look at your draft PR, and my first thought was that it introduces breaking changes in all transforms (they return a tuple instead of a tensor),

Actually, I made sure not to break the user facing API -- only the internal methods apply_transform and randomize_parameters have changed. I only had to change a few unit tests that were calling apply_transform directly. Any tests that used the augment(samples, sample_rate) API passed.

It only returns a tuple when the user passes the additional target=... argument.

which then again led me to check out how computer vision data augmentation libs do it. For example, albumentations has the power to process a target (image/mask, bounding boxes or set of points) in the same way as the input image. In albumentations each transform returns a dictionary, which leaves some flexibility in what is included in the returned value. Maybe that's a useful pattern to adopt here?

It would also mean breaking the API right? I really like the current simple perturbed_samples = augment(samples).

In particular, it is not clear to me what per_samples, per_channel, per_batch would mean for such transforms.

Could you elaborate on what feels unclear?

I was refering to the original transform that made me think about this PR: SumTwoSamples.
What would per_batch mean for such a transform? It would return samples unchanged, right?

Anyway, I think it would make sense to first design the API, then implement it for a simple transform, like Shift or TimeInversion. Also, if we're going to make backwards-incompatible (breaking) changes, it would be nice to release a version with future warnings first, even though we're still in alpha ๐Ÿค“

As stated a few lines above, PR #123 does not make breaking changes, as in: "code that relies on the documented user facing API will still work". Custom augmentations would need to be updated though -- unless we add a supports_target attribute that defaults to False and make sure to honor it in BaseWaveformTransform.forward.

PR #123 does implement target support for TimeInversion.

We should probably start by listing the type of targets used in audio. I personally can't think of any that cannot be shaped as one of the three you mentioned in the original issue.

Agreed. So it comes down to the shape of the time series data then. I don't think I have any better suggestion than the shape you suggested in your PR.

Actually, I made sure not to break the user facing API

Good! I haven't looked at it in detail yet ^^ I might have time for that on friday. Thanks for your patience :)

It only returns a tuple when the user passes the additional target=... argument.

So the output type is either tensor or tuple, based on the parameters given? I'm not sure if this is a good pattern. Do you know of other python libraries that do it like that? Usually functions give the same output type (or sometimes None) regardless of the inputs.

It seems imgaug uses this pattern (returns a tuple of a target is provided):

seq = iaa.Sequential([
    iaa.Crop(px=(0, 16)), # crop images from each side by 0 to 16px (randomly chosen)
    iaa.Fliplr(0.5), # horizontally flip 50% of the images
    iaa.GaussianBlur(sigma=(0, 3.0)) # blur images with a sigma of 0 to 3.0
])

for batch_idx in range(1000):
    # 'images' should be either a 4D numpy array of shape (N, height, width, channels)
    # or a list of 3D numpy arrays, each having shape (height, width, channels).
    # Grayscale images must have shape (height, width, 1) each.
    # All images must have numpy's dtype uint8. Values are expected to be in
    # range 0-255.
    images = load_batch(batch_idx)
    images_aug = seq(images=images)
seq = iaa.Sequential([
    iaa.Multiply((1.2, 1.5)), # change brightness, doesn't affect BBs
    iaa.Affine(
        translate_px={"x": 40, "y": 60},
        scale=(0.5, 0.7)
    ) # translate by 40/60px on x/y axis, and scale to 50-70%, affects BBs
])

# Augment BBs and images.
image_aug, bbs_aug = seq(image=image, bounding_boxes=bbs)

albumentations on the other hand, always returns a dict

Maybe we should post a vote on those two options in Slack?

Option 1: Always return a dict

Inspired by albumentations - 9.9k stars, actively maintained

  • pro: Consistent output type regardless of input. Makes type hinting and code changes easier.
  • pro: Gives more flexibility in what we can include in the output. For example: output sample rate could be included without breaking API change
  • con: More lines of code
  • con: When we change our outputs to dict, it will be a breaking change in the user-facing API
# Example without target
transform = Shift()
transformed_audio = transform(my_audio, sample_rate)["audio"]

# Example with target
transform = Shift()
transformed = transform(my_audio, sample_rate, target=my_target, target_rate=my_target_rate)
transformed_audio = transformed["audio"]
transformed_target = transformed["target"]

Option 2: Return tensor or tuple depending on input

Inspired by imgaug - 12.4k stars, abandoned project

  • pro: Fewer lines of code (but some lines are longer)
  • pro: No breaking change in the user-facing API when we add support for transforming targets
  • con: Less flexibility in what we can include in the output in the future without breaking changes
  • con: Return type depends on the input. This could be a source of bugs in users' code.
# Example without target
transform = Shift()
transformed_audio = transform(my_audio, sample_rate)

# Example with target
transform = Shift()
transformed_audio, transformed_target = transform(
    my_audio, sample_rate, target=my_target, target_rate=my_target_rate
)

Voted!

I posted it in asteroid and thesoundofai. At the moment all 5 votes are in favor of dict. @mpariente suggested that we add support for attribute access

If I should conclude early, I think it's best to return an ObjectDict instead of a tuple or a tensor. ObjectDict behaves like a dict, but also supports access by attribute, so the two following lines would give the same result:

hello = my_object_dict.my_property
hello = my_object_dict["my_property"]

Adding a PR for that here