Unofficial implimentation of Visual Transformers: Token-based Image Representation and Processing for Computer Vision paper.
python main.py task_mode learning_mode data --model --weights
, where:
task_mode
: classification
or semantic_segmentation
for corresponding task
learning_mode
: train
to train --model
from scratch, test
to validate --model
with --weights
on validation data.
data
: path to dataset, in case of classification should be path to image net, in case of semantic segmentation to coco.
--model
:
○ classification: ResNet18
or VT_ResNet18
(will be used by default).
○ semantic segmentation: PanopticFPN
or VT_FPN
(will be used by default).
--weights
must be provided if learning_mode
equals to test
, won't be used in train
mode.
--from_pretrained
uses to continue training from some point, should be state_dict
that contains model_state_dict
, optimizer_state_dict
and epoch
.
- final metrics and losses after 15 and 5 epochs of classification and semantic segmentation respectively:
|
ResNet18 |
VT-ResNet18 |
Training accuracy |
0.664675 |
0.672889 |
Validation accuracy |
0.691541 |
0.696929 |
|
|
|
Training loss |
1.312150 |
1.249382 |
Validation loss |
1.173559 |
1.114401 |
|
|
Panoptic FPN |
VT-FPN |
Training mIOU |
8.0968 |
7.0343 |
Validation mIOU |
4.3148 |
3.2351 |
|
|
|
Training loss |
2.044084 |
2.068598 |
Validation loss |
2.101253 |
2.120928 |
|
- loss and metric curves of classification and semantic segmentation:
cross entropy loss |
accuracy |
|
|
pixel-wise cross entropy loss |
mIOU |
|
|
- Efficiency and parameters
|
Params (M) |
FLOPs (M) |
Forward-backward pass (s) |
ResNet18 |
11.2 |
822 |
0.016 |
VT-ResNet18 |
12.7 |
543 |
0.02 |
|
|
|
|
Panoptic FPN |
16.4 |
67412 |
0.08 |
VT-FPN |
40.3 |
110019 |
0.062 |