MixerUNet from TransUNet (Monocular Depth Estimation)

Using TransUnet and MixerUNet (modified) for depth estimation with kitti dataset and comparing ViT (TransUNet) with MLP-Mixer (MixerUNet) by their performance.

Reference

depth estimation model "BTS" => Refrence for dataloading and training, testing code.

semantic segmentation model "TransUnet" => Refrence for network code.

Preparation

1. Available pre-trained ViT models and MLP-Mixer models

2. Prepare KITTI dataset and Project folder

.  
├── dataset  
│     └── kitti_dataset  
│            ├── 2011_09_26  
│            ├── ...  
│            ├── 2011_10_03  
│            └── data_depth_annotated  
│                   ├── 2011_09_26_drive_0001_sync  
│                   └── ...  
└── MixerUNet  
     ├── checkpoints  
     ├── models 
     ├── outputs  
     ├── results  
     └── train_test_inputs

3. Training

Need to change args.model_name to what ever you like.
args.vit_name: choose between R50-ViT-B_16 (for ViT), R50-Mixer-My_16 (MLP-Mixer).

For training TransUNet

python main.py arguments_train_TransUNet.py

For trainig MixerUNet

python main.py arguments_train_MixerUNet.py

4. Testing and saving results

Need to change args.checkpoint_path "./outputs/<args.model_name>/" to the model trained from above.

For Testing TransUNet

python main.py arguments_test_TransUNet.py

For Testing MixerUNet

python main.py arguments_test_MixerUNet.py

Implementation Details

TransUNet

Image Loader

Kitti dataset consists of images with size of [375, 1242] or [376, 1241]
When loading the data, dataloader.py crops the images to [352, 1216] which is dividable by 16.
TransUNet code only takes inputs with same width and height[224, 224]. However, Kitti dataset has different sizes of width and height.
So I modified the TransUNet code to take inputs with different sized width and height.
Done by modifying the Encoder part.

Training (without position embeddings)

When training, images are random cropped to [352, 704].
The TransUnet's reshape part between Encoder and Decoder relys on the input image size so I modified the code for this issue.
Modified the class DecoderCup() in model.py

Online Eval

When evaluation, images are not random croped. So the input size is [352, 1216].
Due to this, the dimenstion for reshaping in between encoder and decoder was an issue.
I modified the class DecoderCup() def forward() in model.py by adding reshape_size parameter to reshape the input of decoder with respect to the input image shape.

Testing and saving the output image(depth_estimated).

Because trained without position embeddings, when loading state_dict => model.load_state_dict(checkpoint['model'], strict=False) to not load position embeddings weight or any other missing weights.

1st Trial (without pretraining and position enmbeddings)

after 16 epochs

best	d1	d2	d3	silog	rms	abs_rel	log_rms	log10	sq_rel
TransUNet	0.90746	0.98142	0.99575	12.19566	3.19173	0.08877	0.13404	0.03874	0.39217

<2011_09_26_drive_0009_sync_0000000128.png> <2011_09_26_drive_0013_sync_0000000085.png>

2nd Trial (pretrained weights without position embeddings)

after 11 epochs

best	d1	d2	d3	silog	rms	abs_rel	log_rms	log10	sq_rel
TransUNet	0.90746	0.98142	0.99575	12.19566	3.19173	0.08877	0.13404	0.03874	0.39217
TransUNet_Pre	0.91733	0.98775	0.99784	9.64381	2.75907	0.09390	0.12225	0.03917	0.32307

<2011_09_26_drive_0009_sync_0000000128.png>
<2011_09_26_drive_0013_sync_0000000085.png>
<2011_09_26_drive_0052_sync_0000000030.png>
<2011_09_26_drive_0117_sync_0000000572.png>

Interim check

Not predicting well on bright & far distance (e.g. sky, high contrast pixels)

MixerUNet

3rd Trial (ViT -> MLP-Mixer)

Limitations

Due to mlp, the encoding input is fixed and in training and testing, the input size must be the same.
Can not random crop the input (352, 1216) to (352, 704) when training.
a. Not random cropping the input and train the whole image. => Used method.
b. Random Resize Crop could be considered but the changing the ratio of the image might affect the training and prediction.
Cannot load pretrained weights of MLP-Mixer.
MLP-Mixer pretrained image size is (224, 224) so the input of MLP Block is fixed to (196, 768) which is not the same for the image size (352, 704) or (352, 1216).
a. Train from scratch.
b. Weight initialization.
c. Pretrained Channel Mixing weights and initialized Token Mixing weights. => Used method.

after 18 epochs

best	d1	d2	d3	silog	rms	abs_rel	log_rms	log10	sq_rel
TransUNet	0.90746	0.98142	0.99575	12.19566	3.19173	0.08877	0.13404	0.03874	0.39217
TransUNet_Pre	0.91733	0.98775	0.99784	9.64381	2.75907	0.09390	0.12225	0.03917	0.32307
MixerUNet	0.90336	0.97935	0.99501	12.21110	3.33323	0.09647	0.13867	0.04194	0.42090

<2011_09_26_drive_0009_sync_0000000128.png>

<2011_09_26_drive_0013_sync_0000000085.png>
<2011_09_26_drive_0052_sync_0000000030.png>
<2011_09_26_drive_0117_sync_0000000572.png>

4th Trial (token mixing mlp dim: 384 -> 384*8)

Did not considered the input size.
For standard MLP-Mixer, the input size was 224 so the input number of tokens was (224/16)^2 = 196
However, kitti dataset input size is 352x1216 so the input number of tokens is (3521216/16^2) = 1672 which is about 8.5 times larger than 384.
So the token mixing layer's mlp dimension for Kitti dataset should be 8 times larger (3848) than the standard token mixing layer's mlp dimension (384).

after 17 epochs, lr 1e-4 => 1e-3

best	d1	d2	d3	silog	rms	abs_rel	log_rms	log10	sq_rel
TransUNet	0.90746	0.98142	0.99575	12.19566	3.19173	0.08877	0.13404	0.03874	0.39217
TransUNet_Pre	0.91733	0.98775	0.99784	9.64381	2.75907	0.09390	0.12225	0.03917	0.32307
MixerUNet	0.90336	0.97935	0.99501	12.21110	3.33323	0.09647	0.13867	0.04194	0.42090
MixerUNet_Pre	0.92374	0.9858	0.99702	10.81143	3.01994	0.08174	0.12245	0.03585	0.34027