ifnspaml/SGDepth

Dataset get item format

VladimirYugay opened this issue · 7 comments

Hey there,

Thanks for your work!

I have a question regarding the dataset format. I'm currently using my own dataset for both depth and segmentation and decided to implement it via simple_mode with the read_from_folder function. Is there any way to know what is the expected format of a sample in case of default parameters?

I created a test directory containing 'color', 'depth', 'segmentation' subfolders, and the description file. All images are named from 0000 to 0010.
For the training dataset, I get the following 'data_files':

('color', 0, -1): ['color/0005.jpg', 'color/0006.jpg', 'color/0009.jpg', 'color/0008.jpg', 'color/0002.jpg', 'color/0001.jpg', 'color/0007.jpg', 'color/0010.jpg', 'color/0000.jpg']

('color', -1, -1): ['color/0003.jpg', 'color/0005.jpg', 'color/0006.jpg', 'color/0009.jpg', 'color/0008.jpg', 'color/0002.jpg', 'color/0001.jpg', 'color/0007.jpg', 'color/0010.jpg']

('color', 1, -1): ['color/0006.jpg', 'color/0009.jpg', 'color/0008.jpg', 'color/0002.jpg', 'color/0001.jpg', 'color/0007.jpg', 'color/0010.jpg', 'color/0000.jpg', 'color/0004.jpg']

I've read the docs for the data loader repo, but shouldn't (0, -1, 1) be something like (0003, 0004, 0005)?

And another question, which part of the cityscapes was used? There are several options available for download on the website.

Hey there,

to be honest, I never really used/tested the simple_mode on sequential data, but mainly used it for pair-wise data such as color images/segmentation masks. From the looks of it, I would however suspect that the sorting of the images appears to be wrong. I noticed several times already that the standard sorting of Linux/Windows when using the os library behaves differently so a possible fix might be to just enforce an alphanumerical sorting of the list of images with some_list=sorted(some_list, key=str.lower()). I will try to look into it further, if I have time, but maybe this helps already.

Regarding the Cityscapes images I used the training split of the standard 5000 images (leftImg8bit_trainvaltest.zip, gtFine_trainvaltest.zip) for training and the corresponding 500 validation images for segmentation evaluation.

Hope this helps!

Hey there,

Thanks for the response. I can definitely fix the sorting issue, the only thing I'm hesitant about is how it should look in the end.
If we have a folder 'color' with 5 images with names 1.jpg, 2.jpg ... 5.jpg, the resulting 'data_files' should look like:

('color', 0, -1): [2.jpg, 3.jpg, 4.jpg]

('color', -1, -1): [1.jpg, 2.jpg, 3.jpg]

('color', 1, -1): [3.jpg, 4.jpg, 5.jpg]

?

Hey,

yes this would be the format I would also expect. The second index in the key gives the frame number, so the key ('color', 0, -1) should contain all images for which you have a preceding and succeeding frame (2.jpg, 3.jpg, 4.jpg). The corresponding preceding and succeeding frames should be in the other keys (-1 = preceding, 1= succeeding).

Hey there,

Thanks for the help. I have a question regarding learning the depth
The setup is exactly the same as before. There's quite some camera motion between the frames (similar to KITTI) and there are only 10 images on which I try to overfit. I disabled all the augmentation which might make the training more complicated for this particular case.

I have a different segmentation task with only two classes to segment. When trying first with cross-entropy and with some weighting it refused to learn completely since my classes are unbalanced and after using focal loss with the high focal parameter it became better. However, I still can't overfit the depth (10 epochs). Below is the result I get from the inference script

045

What are the possible reasons for such behavior? Depth loss during training is always around 0.11. I've also checked whether the images [-1, 0, 1] are passed correctly to the loss computation function.

Hey there,

on some datasets I could observe the same behaviour, that the depth training tends to be rather unstable in the beginning, resulting in the behaviour you describe that the output is just constant and the loss remains unchanged. The combined training of two networks (depth + pose) using just a single loss tends to be sensitive to the choice of the initial images, though I could find out a pattern here, yet. If you train the depth without the segmentation part, does it converge then?

A first solution could then be to use network weights pretrained on KITTI and to see, if the training/overfitting converges then. My guess would be that if the initial output is already closer to a depth map, then the training should be more stable.

Another possibility would be to use some kind of supervision for the pose network (would also solve the scale ambiguity), if that is an option in your case.

Hey there,

Tried it without seg loss, just setted loss = loss_depth instead of loss = loss_depth + loss_seg, depth is still dead.

Tried starting from the checkpoint excluding segmentation blocks (we have 2 classes for segmentation instead of 20). Got the following result after overfitting for 10 epochs

After overfitting even more, for 50 epochs, the quality of the depth map improved, while segmentation becomes partially "killed"

However, the default network output has a smoother depth.

  1. Is there any other way of making the depth map smoother except --depth-disparity-smoothness? I've tried the values like 1, 0.1, 00.1 and they are not better than the default

  2. What do you mean under pose supervision? Does it mean to insert some ground-truth egomotion for some of the sequences during training? Won't this lead to overfitting for egomotion?

Hey there,

Regarding 1. you could also try to train at a lower resolution, depending on what output resolution you need in the end. If 416/128 is enough for you, then this might work. Also, the multi-task training is sometimes sensitive to the weighting of the two tasks, so you could also see, if varying this weighting gives you better results.

Regarding 2. I indeed was thinking of supervising the egomotion by some kind of ground truth. If you have the full ground truth, you could use that or, if not, then velocity and time stamps of the images can be used to constrain the translation between two images as in https://arxiv.org/pdf/1905.02693.pdf. This might lead to a little more overfitting, but it could also stabilize the depth/egomotion training in the beginning through this additional constraint.