learningRate seted when use single or multi GPU

Question

learningRate seted when use single or multi GPU

buaacarzp opened this issue 3 years ago · 16 comments

your learning rate isn't modified when changing the single gpu to multi gpu, are you noticed it ?

Answer 1 · 2021-04-22T13:58:49.000Z

You can see the baseline for LR settings(try to check pycls Model Zoo for more details):

I use a reference learning rate of 0.1(batch size=128) and a weight decay of 5e-5,
so the LR is set to 6.4 when the batch size is 8192 (128*64GPUs), check the following results:

model	epoch	total batch	lr policy	base lr	Acc@1	Acc@5	model / config
resnet18	100	256 (32*8GPUs)	cos	0.2	70.902	89.894	Drive / cfg
resnet18	100	1024 (128*8GPUs)	cos	0.8	70.994	89.892
resnet18	100	8192 (128*64GPUs)	cos	6.4	70.165	89.374
resnet18	100	16384 (256*64GPUs)	cos	12.8	68.766	88.381
resnet50	100	256 (32*8GPUs)	cos	0.2	77.252	93.430	Drive / cfg
botnet50	100	256 (32*8GPUs)	cos	0.2	77.604	93.682	Drive / cfg
resnext101	100	256 (32*8GPUs)	cos	0.2	78.938	94.482

Answer 2 · 2021-04-22T14:06:43.000Z

TODO:

clarify the training settings
add NUM_CLASSES
add timm
add more baseline

Answer 3 · 2021-04-22T14:21:51.000Z

Hi, @buaacarzp

You can see the baseline for LR settings (try to check pycls Model Zoo for more details ):

I use a reference learning rate of 0.1(batch size=128) and a weight decay of 5e-5,
so the LR is set to 6.4 when the batch size is 8192 (128*64GPUs), check the following results:

model epoch total batch lr policy base lr Acc@1 Acc@5 model / config
resnet18 100 256 (328GPUs) cos 0.2 70.902 89.894 Drive / cfg
resnet18 100 1024 (1288GPUs) cos 0.8 70.994 89.892
resnet18 100 8192 (12864GPUs) cos 6.4 70.165 89.374
resnet18 100 16384 (25664GPUs) cos 12.8 68.766 88.381
resnet50 100 256 (328GPUs) cos 0.2 77.252 93.430 Drive / cfg
botnet50 100 256 (328GPUs) cos 0.2 77.604 93.682 Drive / cfg
resnext101 100 256 (32*8GPUs) cos 0.2 78.938 94.482

Okay, here's a question:

Why is the single-server multi-card training speed of the mp slower than the launch time?
Why is the time taken on the CIFAR10 server slower than the time used on the launch single-device multi-card server?
Why is the CUDA memory usage sometimes 0% when multiple GPU cards are used on a single device during data set training? However, the CUDA memory usage of both GPU cards on the CIFAR10 is 95% +?

Answer 4 · 2021-04-22T14:44:17.000Z

The biggest question: IO time
Something IO is the real bottleneck, not GPU.

CIFAR10 is loaded in memory. If you have a server with large RAM(such as 512GB), you can also loaded whole imagenet dataset.

BTW, I think there is no essential difference between mp and launch..XD

Answer 5 · 2021-04-22T14:46:43.000Z

FYI
https://www.zhihu.com/answer/907835663

Answer 6 · 2021-04-23T01:49:29.000Z

The biggest question: IO time
Something IO is the real bottleneck, not GPU.

CIFAR10 is loaded in memory. If you have a server with large RAM(such as 512GB), you can also loaded whole imagenet dataset.

BTW, I think there is no essential difference between mp and launch..XD

Hi,Bro,You mean the dataloader in CIFAR 10 has pin_memory set as beblow?

train_loader = torch.utils.data.DataLoader(
        trainset,
        batch_size=BATCH_SIZE,
        num_workers=4,
        pin_memory=True,
        sampler=train_sampler,
    )

Answer 7 · 2021-04-23T02:27:27.000Z

I mean try to focus on ImageNet, not CIFAR

Answer 8 · 2021-04-23T03:30:21.000Z

But you say that CIFAR10 is loaded in memory. , and then you said try it on imagenet. The problem I have now is that the dataloader reads data slowly. It takes 3 seconds on average to read data in each batch. Is this related to my data? Because my data is video data, I store it as an npy file, read the image and save it as an npy file. Why?

Answer 9 · 2021-04-23T03:49:28.000Z

1: CIFAR10 is loaded in memory means the whole dataset(50K train images and 10K test images) are loaded in memory, so the IO is faster than ImageNet.
2: Is this related to my data, I think so.
3: pin_memory 's help is limited
4: Why? check your dataset and code...XD.

Answer 10 · 2021-04-23T03:53:24.000Z

5: The problem I have now is that the dataloader reads data slowly. It takes 3 seconds on average to read data in each batch.

This is related to your hardware device(CPU, SSD) and the dataset itself

Answer 11 · 2021-04-23T03:59:50.000Z

6: Is there any preprocessing after loading your video data? try to find the function or position which costs the most time.

Answer 12 · 2021-04-23T06:24:21.000Z

1.Why is the CIFAR10 data set loaded in memory? Is it because of the pin_memory setting? I set it to True when I loaded my dataset, but only one card was 100% during training, and the other card was 0 for a long time and the cuda memory increased occasionally.
2.Another problem: I've been using your dis~uuu library these past two days. I find that when calculating acc, the acc tested only on the split training set or verification set does not represent the whole data set. Is there any document to look at to solve this problem?

Answer 13 · 2021-04-29T08:31:16.000Z

1.Why is the CIFAR10 data set loaded in memory? Is it because of the pin_memory setting? I set it to True when I loaded my dataset, but only one card was 100% during training, and the other card was 0 for a long time and the cuda memory increased occasionally.

A: Plz check the following snippets:

https://github.com/pytorch/vision/blob/cac8a97b0bd14eddeff56f87a890d5cc85776e18/torchvision/datasets/cifar.py#L12
https://github.com/pytorch/vision/blob/cac8a97b0bd14eddeff56f87a890d5cc85776e18/torchvision/datasets/cifar.py#L80

You may find why the CIFAR10 data set loaded in memory.

as for pin_memory, plz check this article, For data loading, passing pin_memory=True to a DataLoader will automatically put the fetched data Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled GPUs. and what is pinned memory, plz review the knowledge of the Operating System.

2.Another problem: I've been using your dis~uuu library these past two days. I find that when calculating acc, the acc tested only on the split training set or verification set does not represent the whole data set. Is there any document to look at to solve this problem?

A: This is the standard test method based on ResNet paper(see this script), I don't think there is any problem.

Answer 14 · 2021-04-29T08:52:22.000Z

and see https://github.com/pytorch/vision/blob/cac8a97b0bd14eddeff56f87a890d5cc85776e18/torchvision/datasets/folder.py#L272, check the usage of torchvision.datasets.ImageFolder

Answer 15 · 2021-05-06T10:51:11.000Z

Hi, @buaacarzp, I update the settings, see 9223904

Answer 16 · 2021-05-06T11:42:27.000Z

Hi, @buaacarzp, I update the settings, see 9223904

I saw it in great detail. good job.