/Visual-Transformers

Unofficial implimentation of Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Primary LanguageJupyter Notebook

Visual-Transformers

Unofficial implimentation of Visual Transformers: Token-based Image Representation and Processing for Computer Vision paper.

Usage:

python main.py task_mode learning_mode data --model --weights, where:

  • task_mode: classification or semantic_segmentation for corresponding task
  • learning_mode: train to train --model from scratch, test to validate --model with --weights on validation data.
  • data: path to dataset, in case of classification should be path to image net, in case of semantic segmentation to coco.
  • --model:
    ○ classification: ResNet18 or VT_ResNet18 (will be used by default).
    ○ semantic segmentation: PanopticFPN or VT_FPN (will be used by default).
  • --weights must be provided if learning_mode equals to test, won't be used in train mode.
  • --from_pretrained uses to continue training from some point, should be state_dict that contains model_state_dict, optimizer_state_dict and epoch.

Results:

  • final metrics and losses after 15 and 5 epochs of classification and semantic segmentation respectively:
ResNet18 VT-ResNet18
Training accuracy 0.664675 0.672889
Validation accuracy 0.691541 0.696929
Training loss 1.312150 1.249382
Validation loss 1.173559 1.114401
Panoptic FPN VT-FPN
Training mIOU 8.0968 7.0343
Validation mIOU 4.3148 3.2351
Training loss 2.044084 2.068598
Validation loss 2.101253 2.120928
  • loss and metric curves of classification and semantic segmentation:
cross entropy loss accuracy
classification loss classification metric
pixel-wise cross entropy loss mIOU
semantic segmentation_loss semantic segmentation mIOU
  • Efficiency and parameters
Params (M) FLOPs (M) Forward-backward pass (s)
ResNet18 11.2 822 0.016
VT-ResNet18 12.7 543 0.02
Panoptic FPN 16.4 67412 0.08
VT-FPN 40.3 110019 0.062

Weights: