GPU memory issues when composing some of the waveform augmentations

Question

GPU memory issues when composing some of the waveform augmentations

luisfvc opened this issue 2 years ago · 12 comments

Hi, I have been experiencing some memory problems when using some of the transforms on the GPU. When I apply the low or high pass filtering, the memory usage of my GPU increases each training iteration. And since I updated from v0.9.0 to the latest release, the same happens with the impulse response transform. This does not happen when I compose other transforms, like polarity inversion, gain, noise or pitch shift.

Any ideas on why this is happening? I went through the package source code but couldn't spot any bug.
Thanks & regards

Answer 1 · 2022-04-05T17:41:15.000Z

Hi. That's curious!
I haven't noticed this issue myself, and I use LPF and HPF in some of my own training scripts.
I don't have an idea on why this is happening at the moment. If you can create a minimal script that reproduces the problem, that would be helpful 👍

The impulse response transform had almost no changes between 0.9.0 and 0.10.1 🤔

Answer 2 · 2022-04-05T19:32:12.000Z

Do you init your transforms once and then use them many times or do you init them every time you need to run them?

Answer 3 · 2022-09-02T08:44:02.000Z

Hi! I'm running into a similar problem, but only when training on multiple GPUS. I use pytorch lightning, It'll take some time but I will try to create a script to reproduce the problem. Are you also using multiple GPUs @luisfvc ?

Answer 4 · 2022-09-20T19:42:16.000Z

Hi, I've noticed the same problem with HPF and LPF. I'm only training on a single GPU, but found that it only occurs if I'm using multiprocessing in my dataloader (i.e. num_workers > 0). Could it be related to pytorch/pytorch#13246 (comment)? That's what I thought I was debugging until I realized these filters were the real culprit

Answer 5 · 2022-09-21T10:43:13.000Z

Thanks, that comment helps us getting closer to reproducing the bug

Answer 6 · 2022-09-29T00:00:07.000Z

Hi, I've noticed the same problem with HPF and LPF. I'm only training on a single GPU, but found that it only occurs if I'm using multiprocessing in my dataloader (i.e. num_workers > 0). Could it be related to pytorch/pytorch#13246 (comment)? That's what I thought I was debugging until I realized these filters were the real culprit

I have the exact same experience. Had to set num_workers=0 when using torch-audiomentations. Curious if we found a better solution?

Answer 7 · 2022-09-29T06:53:58.000Z

Thanks RoyJames :) Just so I understand your way of using torch-audiomentations, I'd like to know:

Did you run the transforms on CPU (in each data loader worker)? And did you train the ML model on GPU?

I have added a "Known issues" section to readme now, by the way: https://github.com/asteroid-team/torch-audiomentations#known-issues

Answer 8 · 2022-09-29T07:01:05.000Z

I should write this article soon, to make it easier to decide if torch-audiomentations is a good fit and how. Also, it would be swell if someone/we could reproduce and fix this memory leak 😅 I don't have a lot of spare time to do it right now, but I'd love to help

Answer 9 · 2022-09-29T07:16:37.000Z

torch-audiomentations

I think (hope) I did those augmentations on the GPU since the incoming data is already on CUDA. I wrapped torch-audiomentations functions in a preprocessor class that was used as the collate function of my dataloader. While I can't provide a complete code snippet, it is something like:

class MyPreprocessor:
    def __init__(self, noise_set: Path, device: str = "cuda"):
        self._augmentor = Compose(
                transforms=[
                    Gain(
                        min_gain_in_db=-15.0,
                        max_gain_in_db=5.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    LowPassFilter(
                        min_cutoff_freq=4000.0,
                        max_cutoff_freq=8000.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    AddBackgroundNoise(
                        background_paths=noise_set, 
                        min_snr_in_db=0.0,
                        max_snr_in_db=30.0,
                        p=0.5,
                        p_mode="per_example",
                    )
                ]
            )

    def __call__(self, batch: T.List[np.ndarray])
        AudioPair = namedtuple('AudioPair', ['clean', 'noisy'])
        batch_pairs = [AudioPair(pair[0], pair[1]) for pair in batch]
        batch_pairs = torch.utils.data.dataloader.default_collate(batch_pairs)
        y = batch_pairs.clean.unsqueeze(1).to(self._device)
        
        x = batch_pairs.noisy.unsqueeze(1).to(self._device)
        x = self._augmentor(x, sample_rate=SAMPLE_RATE)
        return x, y

Then my dataloader looks like:

        self.train_loader = torch.utils.data.DataLoader(
            self.train_set,
            sampler=train_sampler,
            collate_fn=MyPreprocessor(noise_set=noise_set, device="cuda"),
            batch_size=BATCH_SIZE,
            drop_last=True,
            num_workers=num_workers,
            shuffle=train_shuffle,
            worker_init_fn=seed_worker,
        )

and I had to set num_workers=0 when training on >1 GPUs. But please correct me if this is not the expected way. I'm running code on remote GPUs and don't really know a good way to debug memory issues (I wish to contribute, any suggestions on where to look?). Currently, this single thread scheme worked ok for me since my GPU utilization is kept high.

Edit: I forgot to mention that I use the above with torch.nn.parallel.DistributedDataParallel if that's relevant.

Answer 10 · 2022-10-05T21:48:19.000Z

torch-audiomentations

I think (hope) I did those augmentations on the GPU since the incoming data is already on CUDA. I wrapped torch-audiomentations functions in a preprocessor class that was used as the collate function of my dataloader. While I can't provide a complete code snippet, it is something like:

class MyPreprocessor:
    def __init__(self, noise_set: Path, device: str = "cuda"):
        self._augmentor = Compose(
                transforms=[
                    Gain(
                        min_gain_in_db=-15.0,
                        max_gain_in_db=5.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    LowPassFilter(
                        min_cutoff_freq=4000.0,
                        max_cutoff_freq=8000.0,
                        p=0.5,
                        p_mode="per_example",
                    ),
                    AddBackgroundNoise(
                        background_paths=noise_set, 
                        min_snr_in_db=0.0,
                        max_snr_in_db=30.0,
                        p=0.5,
                        p_mode="per_example",
                    )
                ]
            )

    def __call__(self, batch: T.List[np.ndarray])
        AudioPair = namedtuple('AudioPair', ['clean', 'noisy'])
        batch_pairs = [AudioPair(pair[0], pair[1]) for pair in batch]
        batch_pairs = torch.utils.data.dataloader.default_collate(batch_pairs)
        y = batch_pairs.clean.unsqueeze(1).to(self._device)
        
        x = batch_pairs.noisy.unsqueeze(1).to(self._device)
        x = self._augmentor(x, sample_rate=SAMPLE_RATE)
        return x, y

Then my dataloader looks like:

        self.train_loader = torch.utils.data.DataLoader(
            self.train_set,
            sampler=train_sampler,
            collate_fn=MyPreprocessor(noise_set=noise_set, device="cuda"),
            batch_size=BATCH_SIZE,
            drop_last=True,
            num_workers=num_workers,
            shuffle=train_shuffle,
            worker_init_fn=seed_worker,
        )

and I had to set num_workers=0 when training on >1 GPUs. But please correct me if this is not the expected way. I'm running code on remote GPUs and don't really know a good way to debug memory issues (I wish to contribute, any suggestions on where to look?). Currently, this single thread scheme worked ok for me since my GPU utilization is kept high.

Edit: I forgot to mention that I use the above with torch.nn.parallel.DistributedDataParallel if that's relevant.

I was able to use num_workers>0 as long as I don't use torch-audiomentations in GPU mode as part of the collate function (or any operations that will be forked during CPU multiprocessing). This way I essentially define the GPU preprocessor function as part of my trainer (rather than the dataloader), and call it first in the forward() function after each mini-batch has been collated and uploaded to GPU. I guess the lesson for me here is to not invoke GPU processing as part of the CPU multiprocessing routine, while those GPUs are already busy with forward & backward computation for the current batch of data. I think it's more like a pytorch/python issue (or just not a good practice at all) rather than an issue with this package.

Maybe this is obvious to some experienced folks. I feel we could mention this caveat to other unaware users?

Answer 11 · 2022-10-06T11:27:39.000Z

This way I essentially define the GPU preprocessor function as part of my trainer (rather than the dataloader), and call it first in the forward() function after each mini-batch has been collated and uploaded to GPU.

Yes, this is the way I use torch-audiomentations on GPU too 👍 It would indeed be nice to have this documented well. I'm currently focusing on the documentation website for audiomentations, but I want to eventually make one for torch-audiomentations too, using the knowledge I gained for making the audiomentations documentation

Answer 12 · 2023-06-14T16:43:02.000Z

I implemented it in the same way and applied it in the training loop, but I'm still experiencing the memory leak.

I've got a GPU Server with multiple GPUs and I am using pytorch lightning with DDP. I'm using only one GPU per process.
The exception happened in the bandpass filter somewhere in the julius code in cufft. Sadly i cannot copy the stack trace because the server is in an offline environment.