ClementPinard/SfmLearner-Pytorch

Does the size of batch-size affect the training results?

youmi-zym opened this issue ยท 14 comments

Hi,
I have run the train.py with the command blow on KITTI-raw-data :
python3 train.py /path/to/the/formatted/data/ -b4 -m0 -s2.0 --epoch-size 1000 --sequence-length 5 --log-output --with-gt
Otherwise the batch-size=80, and the train(41664)/vaild(2452) split is different.
The result I get is:
disp:
Results with scale factor determined by GT/prediction ratio (like the original paper) :
`
abs_rel, sq_rel, rms, log_rms, a1, a2, a3
0.2058, 1.6333, 6.7410, 0.2895, 0.6762, 0.8853, 0.9532

pose:
Results 10
ATE, RE
mean 0.0223, 0.0053
std 0.0188, 0.0036

Results 09
ATE, RE
mean 0.0284, 0.0055
std 0.0241, 0.0035
`
You can see that there's still a quiet big margin with yours:
Abs Rel | Sq Rel | RMSE | RMSE(log) | Acc.1 | Acc.2 | Acc.3
0.181 | 1.341 | 6.236 | 0.262 | 0.733 | 0.901 | 0.964

I think there is no other factors causing this difference, otherwise the batch-size and data split. Therefore, does the size of batch-size affect the training results?

What's more, when I try to train my model with two Titan GPUs, batch-size=80*2=160, the memory usage of each GPU is:
GPU0: about 11G, GPU1: about 6G.
There is a huge memory usage difference between two GPUs, and it seriously impacts multi-gpu trianing.
And then I find the loss calculations are all placed on the first GPU, actually the memory is mainly used to calculate the 4 scales of depth photometric_reconstruction_loss, and we can just move some scales to the cuda:0, and others to cuda:1, it might be better I think.

Hello, thanks for using this code !

The big batch training is more a machine learning problem than an implementation problem I think !
Basically, KITTI is a very hard dataset when considering regularization, you can see it on road textures, which are as uniformly textures as the sky, and the network will consider it to be at infinite distance.

Having a stochastic training, i.e. a small batch size helps the training not to overfit this part of the road I think. With large batch size, you might to try other regularisation techniques.

A bunch of them have been tried in recent papers, such as GeoNet (https://github.com/yzcjtr/GeoNet) or adversarial collaboration (https://github.com/anuragranj/ac) (or mine here ๐Ÿ‘ผ https://github.com/ClementPinard/unsupervised-depthnet)

One the key regularization techniques is to my mind the smooth loss scaling with image textureness. you might to try it and set a higher smooth loss scale. You can see an example here

As for multi-GPU training, I actually don't have multi-GPU so I could not test it thoroughly. But @anuragranj said to me he was able to encapsulate the loss function inside a torch.nn.DataParallel

The key here is probably to make a Module which will call the loss function, which you then put inside a torch.nn.DataParallel to have the splitting occuring before.

@ClementPinard Well, thankful, I see it.

@youmi-zym By the way, I followed this for multigpu. https://github.com/NVIDIA/flownet2-pytorch
Also, you don't want to change the batch size, even if you use multigpu.

@ClementPinard @anuragranj Yeah, thanks for your help, and I have implemented multigpu loss function with your advise. However, I still have some experiment results to share with you ( smooth-loss, texture-smooth-loss ) :
image

It means these experiments run with the top command, the smooth-loss, texture-smooth-loss are mentioned upon, and the weight is set as the command. and, epoch after 140 or 160 don't improve any more.
It shows that the batch-size exactly makes modest influence to the result, but when visualize the result, I find out that small batch-size is good for close object, but big batch-size pays more attention to optimize the far away objects.
The point is that I still get worse result than yours... It's wired.....

According to the original author, quality worsens after 160K iterations anyway (tinghuiz/SfMLearner#42)
Other than that, That's odd that batch size 4 and 1 GPU gives you worse results than the ones reported in the README.
I'll try a training, and see if there is any regression.

How did you split your training dataset ?

Well, there's a big jet lag between us...
here are the content of the val.txt:

2011_09_26_drive_0005_sync_02
2011_09_26_drive_0005_sync_03
2011_09_26_drive_0039_sync_02
2011_09_26_drive_0039_sync_03
2011_09_26_drive_0051_sync_02
2011_09_26_drive_0051_sync_03
2011_09_26_drive_0019_sync_03
2011_09_26_drive_0019_sync_02

41664 samples found in 64 train scenes
2452 samples found in 8 valid scenes
train.txt
val.txt

Thanks a lot!

Here are my results with my own split :

Results with scale factor determined by GT/prediction ratio (like the original paper) : 
   abs_rel,     sq_rel,        rms,    log_rms,         a1,         a2,         a3
    0.1840,     1.3708,     6.3823,     0.2669,     0.7187,     0.9023,     0.9627

I used a regular smooth loss and a trained on a clean version of the repo.
What pretrained model did you use ? Make sure you used the "model_best" version and not the "checkpoint"

I'll check with your split

Results with your split, using model_best :

Results with scale factor determined by GT/prediction ratio (like the original paper) : 
   abs_rel,     sq_rel,        rms,    log_rms,         a1,         a2,         a3
    0.1854,     1.3986,     6.4104,     0.2687,     0.7149,     0.8985,     0.9619

Results with your split, using checkpoint :

Results with scale factor determined by GT/prediction ratio (like the original paper) : 
   abs_rel,     sq_rel,        rms,    log_rms,         a1,         a2,         a3
    0.2040,     1.8203,     6.6266,     0.2914,     0.6971,     0.8848,     0.9510

As such, I think you only used the checkpoint.pth.tar . This is consistent with author's claim that you eventually end up with worse results if you keep on training after more than 140K iterations.

@ClementPinard Thanks for your work and reply, I use the "model_best" version... I will try again with 140k iterations. Thanks a lot.

Here is my exact dataset, in case there has been some regression in the data preporcessing script :

https://mega.nz/#!OIEwEQ4a!Yz5aRFjPHxNwCV2sxIslgWPfppAj_WOpthNTqWUvByo

Thanks a lot ! I will check it and try again !

The result is still bad....
First I clone the latest repo, use the KITTI-GT you give me, and train with the command below:

python3 train.py /home/data/UnsupervisedDepth/KITTI-raw/KITTI_GT -b4 -m0 -s2.0 --epochs 140 --epoch-size 1000 --sequence-length 5 --log-output --with-gt

Then, I use the command below to test:

python3 test_disp.py 	--pretrained-dispnet /home/data/youmi/projects/depth/SfmLearner-Pytorch/checkpoints/KITTI_GT,140epochs,epoch_size1000,seq5,s2.0/10-13-10:32/dispnet_model_best.pth.tar \
			--dataset-dir /home/data/UnsupervisedDepth/KITTI-raw/raw_data_KITTI \
			--dataset-list ./kitti_eval/test_files_eigen.txt

Finally, the result I get:

no PoseNet specified, scale_factor will be determined by median ratio, which is kiiinda cheating (but consistent with original paper)
getting test metadata ...
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 697/697 [00:04<00:00, 156.10it/s]
697 files to test
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 697/697 [00:46<00:00, 14.91it/s]

Results with scale factor determined by GT/prediction ratio (like the original paper) : 
   abs_rel,     sq_rel,        rms,    log_rms,         a1,         a2,         a3
    0.1923,     1.4461,     6.4234,     0.2705,     0.7131,     0.8977,     0.9608

And, yesterday I retrain with 160k iterations, the new result is:

no PoseNet specified, scale_factor will be determined by median ratio, which is kiiinda cheating            (but consistent with original paper)
getting test metadata ... 
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 697/697 [00:04<00:00, 159.61it/s]
697 files to test
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 697/697 [00:46<00:00, 14.87it/s]
Results with scale factor determined by GT/prediction ratio (like the original paper) : 
   abs_rel,     sq_rel,        rms,    log_rms,         a1,         a2,         a3
    0.1871,     1.4147,     6.4782,     0.2687,     0.7140,     0.8982,     0.9621

The result is similar to yours, and it reveals that 140k iterations may not be a bottleneck, I will try with 200 iterations again.

Updata ! Here are the 200 iterations...
no PoseNet specified, scale_factor will be determined by median ratio, which is kiiinda cheating (but consistent with original paper)
getting test metadata ...
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 697/697 [00:04<00:00, 160.04it/s]
697 files to test
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 697/697 [00:46<00:00, 14.95it/s]

Results with scale factor determined by GT/prediction ratio (like the original paper) : 
   abs_rel,     sq_rel,        rms,    log_rms,         a1,         a2,         a3
    0.1960,     1.4912,     6.2764,     0.2698,     0.7004,     0.9006,     0.9631

Thus, the number of iterations really makes big influence to the result, it's difficult to determine the best of it..

Thanks for your insight !

Since it got worse with more training, there may be a problem with the validation set since it is supposed to keep the best network on it. Maybe it's not representative enough compared to the test set ?

Anyway, this auto supervision problem is very hard to make converge correctly, and to make interpretation on the different results.

Good luck for you research!

@ClementPinard Thanks very much