Transformer in numpy

VIT vision transformer

I write a VIT network in numpy fully, including forward and backpropagation.
including those layers, multi attention, PatchEmbed, Position_add, convolution, Fullconnect, flatten, Relu, layer_norm, Cross Entropy loss and MSE loss
In training, it use cpu and slowly, so I use different settings

Training it with MNIST dataset, it’s precision can reach to 97.2%, it's setting is

    epoch = 36
    batchsize = 100
    lr = 0.001
    embed_dim = 96
    images_shape = (batchsize, 1, 30-2, 30-2)
    n_patch = 7
    patchnorm = True
    # [0, 0, 0], [0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 0], [1, 1, 1]
    fixed     = 1 #False
    cls_token = 0 #True
    num_h = [2*2] * 6 #[3, 6, 12, 3, 6, 12]
    patch_convolu = 0 #False

this codes provide functions to save model and restore model to train
you can find those models in model dir

Train with command

python transformer_of_image.py

predict

python predict.py

precision

train in MacBook Pro 2020 Intel

classes	precision
0	0.9908163265306122
1	0.9903083700440528
2	0.9748062015503876
3	0.9831683168316832
4	0.9674134419551935
5	0.9708520179372198
6	0.9739039665970772
7	0.9630350194552529
8	0.9517453798767967
9	0.9544103072348861
all precision	0.972