/MixerUNet

Using TransUnet for depth estimation with kitti dataset

Primary LanguagePython

MixerUNet from TransUNet (Monocular Depth Estimation)

Using TransUnet and MixerUNet (modified) for depth estimation with kitti dataset and comparing ViT (TransUNet) with MLP-Mixer (MixerUNet) by their performance.

Reference

depth estimation model "BTS" => Refrence for dataloading and training, testing code.

semantic segmentation model "TransUnet" => Refrence for network code.

image

Preparation

1. Available pre-trained ViT models and MLP-Mixer models

2. Prepare KITTI dataset and Project folder

.  
├── dataset  
│     └── kitti_dataset  
│            ├── 2011_09_26  
│            ├── ...  
│            ├── 2011_10_03  
│            └── data_depth_annotated  
│                   ├── 2011_09_26_drive_0001_sync  
│                   └── ...  
└── MixerUNet  
     ├── checkpoints  
     ├── models 
     ├── outputs  
     ├── results  
     └── train_test_inputs 
      

3. Training

Need to change args.model_name to what ever you like.
args.vit_name: choose between R50-ViT-B_16 (for ViT), R50-Mixer-My_16 (MLP-Mixer).

  • For training TransUNet
python main.py arguments_train_TransUNet.py
  • For trainig MixerUNet
python main.py arguments_train_MixerUNet.py

4. Testing and saving results

Need to change args.checkpoint_path "./outputs/<args.model_name>/" to the model trained from above.

  • For Testing TransUNet
python main.py arguments_test_TransUNet.py
  • For Testing MixerUNet
python main.py arguments_test_MixerUNet.py

Implementation Details

TransUNet

image

Image Loader

Kitti dataset consists of images with size of [375, 1242] or [376, 1241]
When loading the data, dataloader.py crops the images to [352, 1216] which is dividable by 16.
TransUNet code only takes inputs with same width and height[224, 224]. However, Kitti dataset has different sizes of width and height.
So I modified the TransUNet code to take inputs with different sized width and height.
Done by modifying the Encoder part.

Training (without position embeddings)

When training, images are random cropped to [352, 704].
The TransUnet's reshape part between Encoder and Decoder relys on the input image size so I modified the code for this issue.
Modified the class DecoderCup() in model.py

Online Eval

When evaluation, images are not random croped. So the input size is [352, 1216].
Due to this, the dimenstion for reshaping in between encoder and decoder was an issue.
I modified the class DecoderCup() def forward() in model.py by adding reshape_size parameter to reshape the input of decoder with respect to the input image shape.

Testing and saving the output image(depth_estimated).

Because trained without position embeddings, when loading state_dict => model.load_state_dict(checkpoint['model'], strict=False) to not load position embeddings weight or any other missing weights.

1st Trial (without pretraining and position enmbeddings)

after 16 epochs

best d1 d2 d3 silog rms abs_rel log_rms log10 sq_rel
TransUNet 0.90746 0.98142 0.99575 12.19566 3.19173 0.08877 0.13404 0.03874 0.39217

<2011_09_26_drive_0009_sync_0000000128.png> image <2011_09_26_drive_0013_sync_0000000085.png> image

2nd Trial (pretrained weights without position embeddings)

after 11 epochs

best d1 d2 d3 silog rms abs_rel log_rms log10 sq_rel
TransUNet 0.90746 0.98142 0.99575 12.19566 3.19173 0.08877 0.13404 0.03874 0.39217
TransUNet_Pre 0.91733 0.98775 0.99784 9.64381 2.75907 0.09390 0.12225 0.03917 0.32307

<2011_09_26_drive_0009_sync_0000000128.png>
image <2011_09_26_drive_0013_sync_0000000085.png>
image <2011_09_26_drive_0052_sync_0000000030.png>
image <2011_09_26_drive_0117_sync_0000000572.png>
image

Interim check

image
Not predicting well on bright & far distance (e.g. sky, high contrast pixels)

MixerUNet

image

3rd Trial (ViT -> MLP-Mixer)

Limitations

  1. Due to mlp, the encoding input is fixed and in training and testing, the input size must be the same.
    Can not random crop the input (352, 1216) to (352, 704) when training.
    a. Not random cropping the input and train the whole image. => Used method.
    b. Random Resize Crop could be considered but the changing the ratio of the image might affect the training and prediction.
  2. Cannot load pretrained weights of MLP-Mixer.
    MLP-Mixer pretrained image size is (224, 224) so the input of MLP Block is fixed to (196, 768) which is not the same for the image size (352, 704) or (352, 1216).
    a. Train from scratch.
    b. Weight initialization.
    c. Pretrained Channel Mixing weights and initialized Token Mixing weights. => Used method.

after 18 epochs

best d1 d2 d3 silog rms abs_rel log_rms log10 sq_rel
TransUNet 0.90746 0.98142 0.99575 12.19566 3.19173 0.08877 0.13404 0.03874 0.39217
TransUNet_Pre 0.91733 0.98775 0.99784 9.64381 2.75907 0.09390 0.12225 0.03917 0.32307
MixerUNet 0.90336 0.97935 0.99501 12.21110 3.33323 0.09647 0.13867 0.04194 0.42090

<2011_09_26_drive_0009_sync_0000000128.png>
image
<2011_09_26_drive_0013_sync_0000000085.png>
image <2011_09_26_drive_0052_sync_0000000030.png>
image <2011_09_26_drive_0117_sync_0000000572.png>
image

4th Trial (token mixing mlp dim: 384 -> 384*8)

Did not considered the input size.
For standard MLP-Mixer, the input size was 224 so the input number of tokens was (224/16)^2 = 196
However, kitti dataset input size is 352x1216 so the input number of tokens is (3521216/16^2) = 1672 which is about 8.5 times larger than 384.
So the token mixing layer's mlp dimension for Kitti dataset should be 8 times larger (384
8) than the standard token mixing layer's mlp dimension (384).

after 17 epochs, lr 1e-4 => 1e-3

best d1 d2 d3 silog rms abs_rel log_rms log10 sq_rel
TransUNet 0.90746 0.98142 0.99575 12.19566 3.19173 0.08877 0.13404 0.03874 0.39217
TransUNet_Pre 0.91733 0.98775 0.99784 9.64381 2.75907 0.09390 0.12225 0.03917 0.32307
MixerUNet 0.90336 0.97935 0.99501 12.21110 3.33323 0.09647 0.13867 0.04194 0.42090
MixerUNet_Pre 0.92374 0.9858 0.99702 10.81143 3.01994 0.08174 0.12245 0.03585 0.34027

<2011_09_26_drive_0009_sync_0000000128.png>
image <2011_09_26_drive_0013_sync_0000000085.png>
image <2011_09_26_drive_0052_sync_0000000030.png>
image <2011_09_26_drive_0117_sync_0000000572.png>
image

Results

RGB TransUNet TransUNet_pretrained MixerUNet MixerUNet_pretrained
image image image image image
image image image image image
image image image image image
image image image image image
best testing_time(sec) parameters d1 d2 d3 silog rms abs_rel log_rms log10 sq_rel
TransUNet_Pre 126.24 105M 0.91733 0.98775 0.99784 9.64381 2.75907 0.09390 0.12225 0.03917 0.32307
MixerUNet_Pre 93.67 200M 0.92374 0.9858 0.99702 10.81143 3.01994 0.08174 0.12245 0.03585 0.34027
  • MLP-Mixer's biggest limitation is fixed input dim which cause training and testing image size to be the same. This is crucial because being not available to train with random cropped image will limit the model's performance.
    • Curious whether Random Crop and Resizing it back to the same size of the original input will affect the model's performance and how will it affect it.
  • Despite the limitation of MLP-Mixer, we can see that the performance mlp_mixer gives us is quite good compared to ViT.
  • Also, although MixerUNet(MLP-Mixer) has more parameters, the testing time is less than TransUNet(ViT). More computations are needed for TransUNet(ViT)