Question about the training convergence

Dear Mihai Dusmanu！
Sorry to bother you again! I have two question about the details of the code.
The first question is,I try to modify the feature extraction network part, like use Depth separable convolution to accelerate the model.However, when i train the new model,after training the first epoch,the loss remains at 1.000002 and no longer changes anymore.I change many times and it always remains at 1.000002.Have you ever encountered this problem?I can't figure out the reason.
The second question is ,I find that two different D2_Net definition in model.py and model_test.py, separately used in train.py and extract_features.py.And there are some differences between them in the backbone part.For example,the one use average pooling and the other use max pooling.Is the changes in model architecture necessary for the different use like training and extracting feature?
Thanks ,Anna!

ps:the log file
[train] epoch 1 - batch 11300 / 11600 - avg_loss: 1.054927
[train] epoch 1 - batch 11400 / 11600 - avg_loss: 1.054472
[train] epoch 1 - batch 11500 / 11600 - avg_loss: 1.054008
[train] epoch 1 - avg_loss: 1.053536
[valid] epoch 1 - batch 0 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 100 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 200 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 300 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 400 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 500 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 600 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 700 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 800 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 900 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1000 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1100 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1200 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1400 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1500 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1600 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1700 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - batch 1800 / 1875 - avg_loss: 1.000002
[valid] epoch 1 - avg_loss: 1.000002
[train] epoch 2 - batch 0 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 100 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 200 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 400 / 11600 - avg_loss: 1.000002
[train] epoch 2 - batch 500 / 11600 - avg_loss: 1.000002

Hello,

Regarding your first question: I suspect you are using random initialization for your layers. Since the training is, in part, self-supervised, convergence problems are expected when starting from random weights (given ground-truth correspondences, the network tries to weight their detection based on their matchability with the current descriptors; if the descriptors are "random", then the network will often converge towards the trivial solution of making all descriptors equal). You will have more luck with one of the following options:

Using a network architecture pretrained on ImageNet (e.g., Mobile Net if you want something fast).
Pretraining your network with a different loss (e.g., raw matching by removing the weighting in front of the margin loss) for a few epochs and then switching back to the full loss.

As for the second question, the changes are not "needed", but there are a few things to be taken into account. Firstly, using dilated convolutions improves test performance by yielding a better feature map resolution and more detections. However, using dilated convolutions at training time is significantly slower and requires more VRAM. This is why we settled for training without dilated convolutions and only using them at test time to emulate the same receptive field.

Mihai.

Hello, Mihai Dusmanu！
Thank you soooo much for the detailed reply！I have tried a lot these days.As you said,when I use the pretrained mobileNet or suffleNet ,the convergence problems are solved.However I can only use partial network after interception.If I use the complete mobilenet to replace the DenseFeatureExtractionModule part,it will mention RuntimeError: CUDA error: device-side assert triggered...Results of my search on the search engine is the labels are not aligned...But I have not found the specific error location. I will keep trying and thank you again!
Anna

Hello. I think that the CUDA error is due to the upscaling / downscaling from / to feature map resolution (probably when recovering the descriptors at lines 82-85 in loss.py). If you change the backbone, you will probably need to update the keypoint scaling functions from utils.py and also change the scaling_steps parameter according to your network architecture (e.g. if the stride is 16, then you will need 4 scaling steps).

d2-net/lib/loss.py

Lines 82 to 85 in 8198366

    
           descriptors2 = F.normalize( 
        
               dense_features2[:, fmap_pos2[0, :], fmap_pos2[1, :]], 
        
               dim=0 
        
           )

d2-net/lib/utils.py

Lines 64 to 73 in 8198366

    
           def upscale_positions(pos, scaling_steps=0): 
        
               for _ in range(scaling_steps): 
        
                   pos = pos * 2 + 0.5 
        
               return pos 
        
           def downscale_positions(pos, scaling_steps=0): 
        
               for _ in range(scaling_steps): 
        
                   pos = (pos - 0.5) / 2 
        
               return pos

	descriptors2 = F.normalize(
	dense_features2[:, fmap_pos2[0, :], fmap_pos2[1, :]],
	dim=0
	)

	def upscale_positions(pos, scaling_steps=0):
	for _ in range(scaling_steps):
	pos = pos * 2 + 0.5
	return pos


	def downscale_positions(pos, scaling_steps=0):
	for _ in range(scaling_steps):
	pos = (pos - 0.5) / 2
	return pos