multi-gpu training

Question

multi-gpu training

soyebn opened this issue 3 years ago · 9 comments

Training for 480 epoch on single GPU will take quite a lot of time. So I wanted to check if multi GPU training is possible in the current code. I tried --dist_mode with mpi and auto are not supported as global_utils.AutoGPU() is not defined. But looks like --dist_mode=horovod is supported. Can it be used to do multi GPU training?
Is it possible to share sample command to do multi-gpu training?

Btw it took around 2.83 hrs to complete 1 epoch on my GTX 1080-TI, does this sound reasonable or too high? Typically I have seen on my machine (4 x GTX1080) around 15 mins per epoch for EfficientNet-ES using Ross wightman's timm repo.

Answer 1 · 2021-09-03T17:07:25.000Z

Dear Soyebn,

We have multiple GPU training code in our internal code base but code clean-up is a mess. There are many dirty works to do such as open source license issues, documentation, removing sensitive contents, etc. As this is nearly a one-man project, I have to clean-up and release single GPU training code only for demonstration purpose. It is suggested to NAS the model using our pipeline and then train the model via other more efficient frameworks such as pytorch-Lightning.

With regard to your question, the code for "--dist_mode=horovod" has been cleaned up. Only 'single' mode is tested.

Answer 2 · 2021-09-03T19:04:04.000Z

Yes. I can use definitely use some other code base for training.
One question though, how much is the benefit in accuracy due to teacher-student distillation approach vs. student only training you have observed with flops_400M model? Currently my code is lacking this distillation functionality hence wanted to check.
Btw your search code is amazingly fast. Thanks again.

Answer 3 · 2021-09-03T23:14:46.000Z

The teacher student distill (KD) is very important to ZenNet. The improvement is about 3%-4.5%,more significant than the usual value 0.5%-1% for manually designed network. This is because ZenNet is deeper and narrower that is difficult to train. In other worlds, the expressivity of ZenNet is very powerful but you need a good optimizer and sufficient data to release its power. Soyeb Nagori ***@***.***> 于 2021年9月3日周五 12:04写道：

…

Yes. I can use definitely use some other code base for training. One question though, how much is the benefit in accuracy due to teacher-student distillation approach vs. student only training you have observed with *flops_400M* model? Currently my code is lacking this distillation functionality hence wanted to check. Btw your search code is amazingly fast. Thanks again. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFIQVWPRJO2GZJSTGZT2NBLUAEL25ANCNFSM5DLVWSJA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Answer 4 · 2021-09-13T04:48:19.000Z

Hi MingLin,

I gave one round of training using torchdistill. I defined a network which I had searched using ZenNAS ( 400 M, no SE) in torchdistill framework and trained as per their recipe (KD).

I could speed up per epoch training time to ~30 mins with 4 GTX 1080Ti GPUs. However I have not good accuracy yet with KD. I will keep you posted.

I am trying to keep epochs to 200 (vs 480 in your paper) for now so that I could try out few rounds of training.

Answer 5 · 2021-09-13T05:07:56.000Z

Thank you for the update! Please note that our kd uses inner features as well. Only using fc features in kd will hurt the final performance. 200 epochs is about 2% worse than 480 epochs.

…

On Sun, Sep 12, 2021, 21:48 Soyeb Nagori ***@***.***> wrote: Hi MingLin, I gave one round of training using torchdistill <https://github.com/yoshitomo-matsubara/torchdistill>. I defined a network which I had searched using ZenNAS ( 400 M, no SE) in torchdistill framework and trained as per their recipe (KD). I could speed up per epoch training time to ~30 mins with 4 GTX 1080Ti GPUs. However I have not good accuracy yet with KD. I will keep you posted. I am trying to keep epochs to 200 (vs 480 in your paper) for now so that I could try out few rounds of training. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFIQVWN5M7TQ3HCEY4V7VG3UBV7B3ANCNFSM5DLVWSJA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

Answer 6 · 2021-09-17T04:28:45.000Z

Hi MingLIn,
My training is going on using torch-distill. Looks like I will hit around 70%, much less than what I expected. Compared to your published ZenNet-400M-SE, I have following differences in my training.

Training epoch 480 to 200. (2% drop)
No SE (1% drop?)
May be you had more augmentations than torch-distill supports (1% ?drop )
distill with inner feature loss (?)
Other hyper params differences

I was expecting around 74% (78-4) but looks to be far right now.

By any chance is it possible to share your internal training code, which supports multiple GPU, with me by Google Drive or some other sharing method?

Answer 7 · 2021-09-17T06:48:03.000Z

Dear Soyebn,

Thank you for the update!

No SE cause about 1% drop. We use auto-augmentation, label-smoothing, random erase, mixup.

Inner feature distill is very important. But even without it you should be able to get close to 74%~75%.

The training scripts we released contains everything you need except for the multi-GPU part. I am afraid cleaning-up multi-gpu training code is nearly an impossible mission as we need to re-start the code review process, which is a nightmare to me.

The poor performance might be because your NAS model is not convergent to a good solution. In our paper experiments, the initial structure is randomly generated around 300 MFLOPs. But in our released script, for simplicity, we initialize the EA process from a fixed small structure, for all tasks. Both methods should converge to similar final structures if the EA is carried out for long enough generation.

Would you mind to post your searched structures? I can manually compare it to the one found in our paper.

Answer 8 · 2021-10-01T05:50:58.000Z

Sorry for the late reply.
Here is the n/w I found after searching for 480k with the target of 400ms.
SuperConvK3BNRELU(3,16,2,1)SuperResIDWE2K7(16,24,2,24,1)SuperResIDWE4K7(24,48,2,32,1)SuperResIDWE2K7(48,80,2,160,1)SuperResIDWE6K7(80,152,2,152,3)SuperResIDWE6K7(152,152,1,128,5)SuperResIDWE4K7(152,112,1,128,1)SuperConvK1BNRELU(112,2048,1,1)

Answer 9 · 2021-10-02T03:32:32.000Z

Sorry for the late reply. Here is the n/w I found after searching for 480k with the target of 400ms. SuperConvK3BNRELU(3,16,2,1)SuperResIDWE2K7(16,24,2,24,1)SuperResIDWE4K7(24,48,2,32,1)SuperResIDWE2K7(48,80,2,160,1)SuperResIDWE6K7(80,152,2,152,3)SuperResIDWE6K7(152,152,1,128,5)SuperResIDWE4K7(152,112,1,128,1)SuperConvK1BNRELU(112,2048,1,1)

Comparing to the official released structure, this structure seems to be shallower in lower stages and narrower in higher stages. We do observe when initialized with different seeds or even simply repeat the search, the EA will converge to different structures. I would suggest to manually design a 300 MFLOPs models, with 2x width after each down-sampling layer, and the depth of each stage increases a bit. This will give you a more stable convergence.