BIGBALLON/distribuuuu

learningRate seted when use single or multi GPU

buaacarzp opened this issue · 16 comments

your learning rate isn't modified when changing the single gpu to multi gpu, are you noticed it ?

Hi, @buaacarzp

You can see the baseline for LR settings(try to check pycls Model Zoo for more details):

I use a reference learning rate of 0.1(batch size=128) and a weight decay of 5e-5,
so the LR is set to 6.4 when the batch size is 8192 (128*64GPUs), check the following results:

model epoch total batch lr policy base lr Acc@1 Acc@5 model / config
resnet18 100 256 (32*8GPUs) cos 0.2 70.902 89.894 Drive / cfg
resnet18 100 1024 (128*8GPUs) cos 0.8 70.994 89.892
resnet18 100 8192 (128*64GPUs) cos 6.4 70.165 89.374
resnet18 100 16384 (256*64GPUs) cos 12.8 68.766 88.381
resnet50 100 256 (32*8GPUs) cos 0.2 77.252 93.430 Drive / cfg
botnet50 100 256 (32*8GPUs) cos 0.2 77.604 93.682 Drive / cfg
resnext101 100 256 (32*8GPUs) cos 0.2 78.938 94.482

TODO:

  • clarify the training settings
  • add NUM_CLASSES
  • add timm
  • add more baseline

Hi, @buaacarzp

You can see the baseline for LR settings (try to check pycls Model Zoo for more details ):

I use a reference learning rate of 0.1(batch size=128) and a weight decay of 5e-5,
so the LR is set to 6.4 when the batch size is 8192 (128*64GPUs), check the following results:

model epoch total batch lr policy base lr Acc@1 Acc@5 model / config
resnet18 100 256 (328GPUs) cos 0.2 70.902 89.894 Drive / cfg
resnet18 100 1024 (128
8GPUs) cos 0.8 70.994 89.892
resnet18 100 8192 (12864GPUs) cos 6.4 70.165 89.374
resnet18 100 16384 (256
64GPUs) cos 12.8 68.766 88.381
resnet50 100 256 (328GPUs) cos 0.2 77.252 93.430 Drive / cfg
botnet50 100 256 (32
8GPUs) cos 0.2 77.604 93.682 Drive / cfg
resnext101 100 256 (32*8GPUs) cos 0.2 78.938 94.482

Okay, here's a question:

  1. Why is the single-server multi-card training speed of the mp slower than the launch time?
  2. Why is the time taken on the CIFAR10 server slower than the time used on the launch single-device multi-card server?
  3. Why is the CUDA memory usage sometimes 0% when multiple GPU cards are used on a single device during data set training? However, the CUDA memory usage of both GPU cards on the CIFAR10 is 95% +?

The biggest question: IO time
Something IO is the real bottleneck, not GPU.

CIFAR10 is loaded in memory. If you have a server with large RAM(such as 512GB), you can also loaded whole imagenet dataset.

BTW, I think there is no essential difference between mp and launch..XD

The biggest question: IO time
Something IO is the real bottleneck, not GPU.

CIFAR10 is loaded in memory. If you have a server with large RAM(such as 512GB), you can also loaded whole imagenet dataset.

BTW, I think there is no essential difference between mp and launch..XD

Hi,Bro,You mean the dataloader in CIFAR 10 has pin_memory set as beblow?

train_loader = torch.utils.data.DataLoader(
        trainset,
        batch_size=BATCH_SIZE,
        num_workers=4,
        pin_memory=True,
        sampler=train_sampler,
    )

I mean try to focus on ImageNet, not CIFAR

But you say that CIFAR10 is loaded in memory. , and then you said try it on imagenet. The problem I have now is that the dataloader reads data slowly. It takes 3 seconds on average to read data in each batch. Is this related to my data? Because my data is video data, I store it as an npy file, read the image and save it as an npy file. Why?

1: CIFAR10 is loaded in memory means the whole dataset(50K train images and 10K test images) are loaded in memory, so the IO is faster than ImageNet.
2: Is this related to my data, I think so.
3: pin_memory 's help is limited
4: Why? check your dataset and code...XD.

5: The problem I have now is that the dataloader reads data slowly. It takes 3 seconds on average to read data in each batch.

This is related to your hardware device(CPU, SSD) and the dataset itself

6: Is there any preprocessing after loading your video data? try to find the function or position which costs the most time.

1.Why is the CIFAR10 data set loaded in memory? Is it because of the pin_memory setting? I set it to True when I loaded my dataset, but only one card was 100% during training, and the other card was 0 for a long time and the cuda memory increased occasionally.
2.Another problem: I've been using your dis~uuu library these past two days. I find that when calculating acc, the acc tested only on the split training set or verification set does not represent the whole data set. Is there any document to look at to solve this problem?

1.Why is the CIFAR10 data set loaded in memory? Is it because of the pin_memory setting? I set it to True when I loaded my dataset, but only one card was 100% during training, and the other card was 0 for a long time and the cuda memory increased occasionally.

A: Plz check the following snippets:

https://github.com/pytorch/vision/blob/cac8a97b0bd14eddeff56f87a890d5cc85776e18/torchvision/datasets/cifar.py#L12
https://github.com/pytorch/vision/blob/cac8a97b0bd14eddeff56f87a890d5cc85776e18/torchvision/datasets/cifar.py#L80

You may find why the CIFAR10 data set loaded in memory.

as for pin_memory, plz check this article, For data loading, passing pin_memory=True to a DataLoader will automatically put the fetched data Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled GPUs. and what is pinned memory, plz review the knowledge of the Operating System.

2.Another problem: I've been using your dis~uuu library these past two days. I find that when calculating acc, the acc tested only on the split training set or verification set does not represent the whole data set. Is there any document to look at to solve this problem?

A: This is the standard test method based on ResNet paper(see this script), I don't think there is any problem.

Hi, @buaacarzp, I update the settings, see 9223904

Hi, @buaacarzp, I update the settings, see 9223904

I saw it in great detail. good job.