training time

Thanks for your amazing work, and I'm care about the time of training consuming.
From the config_server_kitti.yaml, look like use 4x3090 GPU with batch size 8. Would you like to share your training time?
More, limited by hardware，do you think that can get comparable performance to the paper report using a single 3090? In addition, is it possible to use amp in torch？

@huixiancheng about amp, I have used it and it has been very useful by speeding up training time on a single RTX3090.

@marius12233 Very well, thx for your advice.
In fact, I have tried using amp and it can be trained on a single 3090 with Batch_size 8.

However, after multiple epoch training, the calculation of entropy values of LIDAR streams will result in nan, which in turn leads to the loss of nan. Like this under.

Train E[050|004] I[4782|0001] DT[1.235] PT[0.567] LR 0.00050 Loss 0.9204 Acc 0.4807 IOU 0.4363 Recall 0.5187 Entropy nan ImgAcc 0.4073 ImgIOU 0.3475 ImgRecall 0.4602 ImgEntropy 0.3132 RT 1 day, 17:39:03.734664

Here is my amp code. Just change this part code.

Would you like to share your code?

        if mode == "Train":
            scaler = torch.cuda.amp.GradScaler()
            with torch.cuda.amp.autocast():
                lidar_pred, camera_pred = self.model(pcd_feature, img_feature)

                lidar_pred_log = torch.log(lidar_pred.clamp(min=1e-8))

                # compute pcd entropy: p * log p
                pcd_entropy = -(lidar_pred * lidar_pred_log).sum(1) / \
                    math.log(self.settings.nclasses)

                loss_lov, loss_foc = self._computeClassifyLoss(
                    pred=lidar_pred, label=input_label, label_mask=label_mask)

                # compute img entropy
                camera_pred_log = torch.log(
                    camera_pred.clamp(min=1e-8))
                # normalize to [0,1)
                img_entropy = - \
                    (camera_pred * camera_pred_log).sum(1) / \
                    math.log(self.settings.nclasses)

                loss_lov_cam, loss_foc_cam = self._computeClassifyLoss(
                    pred=camera_pred, label=input_label, label_mask=label_mask)

                loss_per, pcd_guide_weight, img_guide_weight = self._computePerceptionAwareLoss(
                    pcd_entropy=pcd_entropy, img_entropy=img_entropy,
                    pcd_pred=lidar_pred, pcd_pred_log=lidar_pred_log,
                    img_pred=camera_pred, img_pred_log=camera_pred_log
                )

                total_loss = loss_foc + loss_lov * self.settings.lambda_ + \
                     loss_foc_cam + loss_lov_cam * self.settings.lambda_ + \
                     loss_per * self.settings.gamma

                if self.settings.n_gpus > 1:
                    total_loss = total_loss.mean()

            # backward
            #self._backward(total_loss)

            self.optimizer.zero_grad()
            self.aux_optimizer.zero_grad()

            scaler.scale(total_loss).backward()
            scaler.step(self.optimizer)
            scaler.step(self.aux_optimizer)
            scaler.update()

            # update lr after backward (required by pytorch)
            self.scheduler.step()
            self.aux_scheduler.step()

Also, did you get a similar performance at the end of your training?

@huixiancheng Yes, the trick is to decrease learning rate to 0.0005

@marius12233 That's great. So your train with batch_size of 8 and lr= 0.0005? That's very strange, as far as I know, maybe the lr needs to be linearly reduced to half when the batchsize is set to 4.

In addition, would you like to share the solutions to the problems mentioned above?

@marius12233 Very well, thx for your advice. In fact, I have tried using amp and it can be trained on a single 3090 with Batch_size 8.

However, after multiple epoch training, the calculation of entropy values of LIDAR streams will result in nan, which in turn leads to the loss of nan. Like this under.

Train E[050|004] I[4782|0001] DT[1.235] PT[0.567] LR 0.00050 Loss 0.9204 Acc 0.4807 IOU 0.4363 Recall 0.5187 Entropy nan ImgAcc 0.4073 ImgIOU 0.3475 ImgRecall 0.4602 ImgEntropy 0.3132 RT 1 day, 17:39:03.734664

Here is my amp code. Just change this part code.

Would you like to share your code?

        if mode == "Train":
            scaler = torch.cuda.amp.GradScaler()
            with torch.cuda.amp.autocast():
                lidar_pred, camera_pred = self.model(pcd_feature, img_feature)

                lidar_pred_log = torch.log(lidar_pred.clamp(min=1e-8))

                # compute pcd entropy: p * log p
                pcd_entropy = -(lidar_pred * lidar_pred_log).sum(1) / \
                    math.log(self.settings.nclasses)

                loss_lov, loss_foc = self._computeClassifyLoss(
                    pred=lidar_pred, label=input_label, label_mask=label_mask)

                # compute img entropy
                camera_pred_log = torch.log(
                    camera_pred.clamp(min=1e-8))
                # normalize to [0,1)
                img_entropy = - \
                    (camera_pred * camera_pred_log).sum(1) / \
                    math.log(self.settings.nclasses)

                loss_lov_cam, loss_foc_cam = self._computeClassifyLoss(
                    pred=camera_pred, label=input_label, label_mask=label_mask)

                loss_per, pcd_guide_weight, img_guide_weight = self._computePerceptionAwareLoss(
                    pcd_entropy=pcd_entropy, img_entropy=img_entropy,
                    pcd_pred=lidar_pred, pcd_pred_log=lidar_pred_log,
                    img_pred=camera_pred, img_pred_log=camera_pred_log
                )

                total_loss = loss_foc + loss_lov * self.settings.lambda_ + \
                     loss_foc_cam + loss_lov_cam * self.settings.lambda_ + \
                     loss_per * self.settings.gamma

                if self.settings.n_gpus > 1:
                    total_loss = total_loss.mean()

            # backward
            #self._backward(total_loss)

            self.optimizer.zero_grad()
            self.aux_optimizer.zero_grad()

            scaler.scale(total_loss).backward()
            scaler.step(self.optimizer)
            scaler.step(self.aux_optimizer)
            scaler.update()

            # update lr after backward (required by pytorch)
            self.scheduler.step()
            self.aux_scheduler.step()

@huixiancheng sorry it was my fault. I mean you should start with learning rate of 0.0005 instead of 0.001, which is the default value in the code.

For the code I am trying to share some parts that will be useful for you.
First of all I declared the scaler as a property in the constructor.
Then in the training loop you can do this, which is the same as yours:

`

    if mode == "Train":
        with torch.cuda.amp.autocast():
            lidar_pred, camera_pred = self.model(pcd_feature, img_feature)

            lidar_pred_log = torch.log(lidar_pred.clamp(min=1e-8))

            # compute pcd entropy: p * log p
            pcd_entropy = -(lidar_pred * lidar_pred_log).sum(1) / \
                math.log(self.settings.nclasses)

            loss_lov, loss_foc = self._computeClassifyLoss(
                pred=lidar_pred, label=input_label, label_mask=label_mask)

            # compute img entropy
            camera_pred_log = torch.log(
                camera_pred.clamp(min=1e-8))
            # normalize to [0,1)
            img_entropy = - \
                (camera_pred * camera_pred_log).sum(1) / \
                math.log(self.settings.nclasses)

            loss_lov_cam, loss_foc_cam = self._computeClassifyLoss(
                pred=camera_pred, label=input_label, label_mask=label_mask)

            loss_per, pcd_guide_weight, img_guide_weight = self._computePerceptionAwareLoss(
                pcd_entropy=pcd_entropy, img_entropy=img_entropy,
                pcd_pred=lidar_pred, pcd_pred_log=lidar_pred_log,
                img_pred=camera_pred, img_pred_log=camera_pred_log
            )

            total_loss = loss_foc + loss_lov * self.settings.lambda_ + \
                 loss_foc_cam + loss_lov_cam * self.settings.lambda_ + \
                 loss_per * self.settings.gamma

            if self.settings.n_gpus > 1:
                total_loss = total_loss.mean()

        # backward
        #self._backward(total_loss)

        self.optimizer.zero_grad()
        self.aux_optimizer.zero_grad()

        self.scaler.scale(total_loss).backward()
        self.scaler.step(self.optimizer)
        self.scaler.step(self.aux_optimizer)
        self.scaler.update()
        self.scheduler.step()
        self.aux_scheduler.step()

`

thx for your advice again. I will take a try if have spare computing power.

Would you like to share your version of torch (with cuda)? I want to exclude this part of the impact. @marius12233

I recently discovered that nan may be caused by too large value after convolution here.

PMF/pc_processor/models/salsanext.py

Lines 145 to 147 in 1875be9

    
           upE = self.conv1(upB) 
        
           upE = self.act1(upE) 
        
           upE1 = self.bn1(upE)

The max value of tensor upE maybe will out of the range of fp16 and then lead to inf and nan.

16
tensor(35712., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.9265, device='cuda:0', grad_fn=<AddBackward0>)
17
tensor(inf, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
nan 17 tensor(False, device='cuda:0') tensor(nan, device='cuda:0', grad_fn=<DivBackward0>) tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
loss tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
18
tensor(27680., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8718, device='cuda:0', grad_fn=<AddBackward0>)

So I try to use clip here.

        upE = self.conv1(upB)
        upE = torch.clamp(upE, min=-65504, max=65504)
        if prt:
            print(torch.max(upE))
        upE = self.act1(upE)
        upE1 = self.bn1(upE)

It look like work well.

16
tensor(35712., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.9264, device='cuda:0', grad_fn=<AddBackward0>)
17
tensor(65504., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8150, device='cuda:0', grad_fn=<AddBackward0>)
18
tensor(27936., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8710, device='cuda:0', grad_fn=<AddBackward0>)

However, I'm not sure that an excessive value is normal.

Would you like to share your version of torch (with cuda)? I want to exclude this part of the impact. @marius12233

I recently discovered that nan may be caused by too large value after convolution here.

PMF/pc_processor/models/salsanext.py

Lines 145 to 147 in 1875be9

upE = self.conv1(upB)

upE = self.act1(upE)

upE1 = self.bn1(upE)

The max value of tensor upE maybe will out of the range of fp16 and then lead to inf and nan.
16
tensor(35712., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.9265, device='cuda:0', grad_fn=<AddBackward0>)
17
tensor(inf, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
nan 17 tensor(False, device='cuda:0') tensor(nan, device='cuda:0', grad_fn=<DivBackward0>) tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
loss tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
18
tensor(27680., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8718, device='cuda:0', grad_fn=<AddBackward0>)
So I try to use clip here.
        upE = self.conv1(upB)
        upE = torch.clamp(upE, min=-65504, max=65504)
        if prt:
            print(torch.max(upE))
        upE = self.act1(upE)
        upE1 = self.bn1(upE)
It look like work well.
16
tensor(35712., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.9264, device='cuda:0', grad_fn=<AddBackward0>)
17
tensor(65504., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8150, device='cuda:0', grad_fn=<AddBackward0>)
18
tensor(27936., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8710, device='cuda:0', grad_fn=<AddBackward0>)
However, I'm not sure that an excessive value is normal.

Thank for your patient in this work,did you later reproduce similar results as mentioned in the paper using AMP?

@huixiancheng sorry it was my fault. I mean you should start with learning rate of 0.0005 instead of 0.001, which is the default value in the code.

For the code I am trying to share some parts that will be useful for you. First of all I declared the scaler as a property in the constructor. Then in the training loop you can do this, which is the same as yours:

`

    if mode == "Train":
        with torch.cuda.amp.autocast():
            lidar_pred, camera_pred = self.model(pcd_feature, img_feature)

            lidar_pred_log = torch.log(lidar_pred.clamp(min=1e-8))

            # compute pcd entropy: p * log p
            pcd_entropy = -(lidar_pred * lidar_pred_log).sum(1) / \
                math.log(self.settings.nclasses)

            loss_lov, loss_foc = self._computeClassifyLoss(
                pred=lidar_pred, label=input_label, label_mask=label_mask)

            # compute img entropy
            camera_pred_log = torch.log(
                camera_pred.clamp(min=1e-8))
            # normalize to [0,1)
            img_entropy = - \
                (camera_pred * camera_pred_log).sum(1) / \
                math.log(self.settings.nclasses)

            loss_lov_cam, loss_foc_cam = self._computeClassifyLoss(
                pred=camera_pred, label=input_label, label_mask=label_mask)

            loss_per, pcd_guide_weight, img_guide_weight = self._computePerceptionAwareLoss(
                pcd_entropy=pcd_entropy, img_entropy=img_entropy,
                pcd_pred=lidar_pred, pcd_pred_log=lidar_pred_log,
                img_pred=camera_pred, img_pred_log=camera_pred_log
            )

            total_loss = loss_foc + loss_lov * self.settings.lambda_ + \
                 loss_foc_cam + loss_lov_cam * self.settings.lambda_ + \
                 loss_per * self.settings.gamma

            if self.settings.n_gpus > 1:
                total_loss = total_loss.mean()

        # backward
        #self._backward(total_loss)

        self.optimizer.zero_grad()
        self.aux_optimizer.zero_grad()

        self.scaler.scale(total_loss).backward()
        self.scaler.step(self.optimizer)
        self.scaler.step(self.aux_optimizer)
        self.scaler.update()
        self.scheduler.step()
        self.aux_scheduler.step()

`

Could you please share the complete code? This is my email: wansit99@gmail.com. Thanks!

Would you like to share your version of torch (with cuda)? I want to exclude this part of the impact. @marius12233

I recently discovered that nan may be caused by too large value after convolution here.

PMF/pc_processor/models/salsanext.py

Lines 145 to 147 in 1875be9

upE = self.conv1(upB)

upE = self.act1(upE)

upE1 = self.bn1(upE)

The max value of tensor upE maybe will out of the range of fp16 and then lead to inf and nan.
16
tensor(35712., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.9265, device='cuda:0', grad_fn=<AddBackward0>)
17
tensor(inf, device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
nan 17 tensor(False, device='cuda:0') tensor(nan, device='cuda:0', grad_fn=<DivBackward0>) tensor(nan, device='cuda:0', grad_fn=<DivBackward0>)
loss tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
18
tensor(27680., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8718, device='cuda:0', grad_fn=<AddBackward0>)
So I try to use clip here.
        upE = self.conv1(upB)
        upE = torch.clamp(upE, min=-65504, max=65504)
        if prt:
            print(torch.max(upE))
        upE = self.act1(upE)
        upE1 = self.bn1(upE)
It look like work well.
16
tensor(35712., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.9264, device='cuda:0', grad_fn=<AddBackward0>)
17
tensor(65504., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8150, device='cuda:0', grad_fn=<AddBackward0>)
18
tensor(27936., device='cuda:0', dtype=torch.float16, grad_fn=<MaxBackward1>)
loss tensor(0.8710, device='cuda:0', grad_fn=<AddBackward0>)
However, I'm not sure that an excessive value is normal.

Could you please share the complete code? This is my email: wansit99@gmail.com. Thanks!

	upE = self.conv1(upB)
	upE = self.act1(upE)
	upE1 = self.bn1(upE)