How to distill dataset of size exceeding gpu memory size limit

Question

How to distill dataset of size exceeding gpu memory size limit

Closed this issue 5 years ago · 4 comments

Hi!
I wonder is there any way to distill much more data exceeding gpu memory size limit? For a large scale dataset or a typical 11G/12G gpu memory size, that can be really useful. At first, I thought state.distributed in your code is intended for that by putting distilled data into multiple gpus, then I found out I was wrong. It seems that this code only distills data of size fit for one gpu memory size. So, any advice on this matter?

Thanks a lot!

Answer 1 · 2019-10-08T15:02:59.000Z

Hi, Usually it is the gradient computation graph that is allocating most of the memory. torch.distributed code in the repo actually also helps reducing memory footprint if you are training on multiple networks at each iteration. You can also look into various memory saving techniques (such as checkpointing), but they currently are not provided in this repo.

…

-Tongzhou

On Tue, Oct 8, 2019 at 10:37 JWarlock ***@***.***> wrote: Hi! I wonder is there any way to distill much more data exceeding gpu memory size limit? For a large scale dataset or a typical 11G/12G gpu memory size, that can be really useful. At first, I thought state.distributed in your code is intended for that by putting distilled data into multiple gpus, then I found out I was wrong. So, any advice on this matter? Thanks a lot! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#23>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLJMZPEQL64EIA5WJL3DNLQNSLKDANCNFSM4I6SUB5Q> .

Answer 2 · 2019-10-11T12:19:55.000Z

So the torch.distributed code in this repo is used for dataset distillation on multiple networks. Could you please give an example about how to run distributed training? I don't see any in README.

Also, I think splitting data into multiple gpus and computing gradient on each gpu is much more convenient. But that would means building a computation graph across multiple devices and backpropagating in a subgraph on each device. I wonder is that feasible? any advice?

Answer 3 · 2019-10-11T14:42:24.000Z

The usage of distributed training is documented in the advanced usage page. Maybe it is. I have not investigated that approach.

…

On Fri, Oct 11, 2019 at 05:19 JWarlock ***@***.***> wrote: So the torch.distributed code in this repo is used for dataset distillation on multiple networks. Could you please give an example about how to run distributed training? I don't see any in README. Also, I think splitting data into multiple gpus and computing gradient on each gpu is much more convenient. I wonder is that feasible? any advice? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABLJMZN4Q6TTTSQBDSM2WKLQOBVOZANCNFSM4I6SUB5Q> .

Answer 4 · 2019-11-20T15:16:03.000Z

I’m going to close this issue for now. Feel free to open a new one if you have other questions!