Was setting the spatial attribute to 0 in the BatchNormalization nodes of the ArcFace intended ? A user notes that setting spatial=1 returns the right result as well. So trying to understand if setting spatial = 0 (the non-default value) for the opset 8 model an accident.

The ArcFace model was prepared using MXNet and then converted to ONNX format using the MXNet to ONNX converter.

For BatchNorm, MXNet computes mean and variance per feature which is why we explicitly set spatial=0 when translating BatchNorm layers from MXNet to ONNX.

@abhinavs95 can this model be updated to use spatial=1? The ONNX standard has dropped support for spatial=0 from opset10 onwards and onnxruntime doesn't plan to support this.

The spatial parameter is set to 0 in the MXNet to ONNX converter probably due to behavior of MXNet batchnorm: https://github.com/apache/incubator-mxnet/blob/745a41ca1a6d74a645911de8af46dece03db93ea/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py#L357

I'll try to see if this model can be converted with spatial=1.

@pranavsharma changing the spatial parameter cannot be done using the mxnet to onnx converter API as I had hoped, it requires modification of the source code. I am currently busy focussing on another project, I will provide an update when I get a chance to work on this.

@abhinavs95 any update on this? onnxruntime does not (and possibly will not) support spatial==0 on its CPU provider, making tensorrt-inference-server unable to load exported models (see here).

There are more models on the ONNX model zoo with this bug: Yolov3 and Duc are also non-usable by ONNX Runtime for the same reason. When will this be fixed?

Yolov3 is not impacted by this and has been successfully tested as-is.

Duc and ArcFace models need to be updated to a newer ONNX version. Hopefully @abhinavs95 can make the necessary modifications soon.

Just to reiterate on this, even on GPU backend with ONNXRuntime (v0.4 or v0.5) the current model in the repository is producing wrong results the feature vector returned from the final fc layer are always NaN.
I strongly suggest retiring this model and maybe replace it by a PyTorch version of the same thing until MXNet updates their ONNX exporter to latest specification

Working on it currently training a new version using a PyTorch implementation (model seems to export into ONNX in general) from scratch with Ms1m dataset, But this is going to take a while since I have it on low priority.

I am afraid I am stll training I have it at low priority that's why it takes time. Hopefully soon.
I will check if an intermediate snapshot is exportable but I don't see why not.

@Mut1nyJD whether arcface from pytorch to onnx have get right result

I think I finally cracked this issue. I supported non-spatial mode in ORT in this PR - microsoft/onnxruntime#2092 but it still won't run the ArcFace model in the ONNX zoo.

This is because the ArcFace modelis an invalid ONNX model because it violates the ONNX spec (https://github.com/onnx/onnx/blob/master/docs/Changelog.md#BatchNormalization-7). It has BatchNorm nodes with spatial == 0 but the input shapes don’t adhere to the required shape.

The spec says that the input shape should be ( C, D1, D2,…, Dn) for the inputs when spatial == 0:


But, in the model it has shape – [C]. This is only allowed for spatial == 1.


So, supporting non-spatial mode in ORT will not solve this problem. This is a bug in the MXNet exporter wherein it actually means spatial == 1 but still stamping the BatchNormalization node with spatial == 0. The output results are correct when we run the model assuming spatial == 1.
So, the model doesn’t need re-conversion, it only needs an update in the model proto to make spatial == 1 in all the BN nodes and it will run correctly in ORT.

@hariharans29 you can try with this model here. this has spatial=0 and reshaped. I look forward to the result from you.

Hi @luan1412167 - I actually think it should be the opposite (spatial == 1).

Hi all,

I wrote a simple script to "correct" (not re-convert from base model) the ONNX model zoo ArcFace model from here - https://github.com/onnx/models/tree/master/vision/body_analysis/arcface.

This link contains the model (named resnet100.onnx) and test data. The script to correct the model is this (it is not possible to attach the corrected model as the size exceeds allowed limits)-

import onnx

model = onnx.load(r'arcface_mxnet\resnet100.onnx')

for node in model.graph.node:
    if(node.op_type == "BatchNormalization"):
        for attr in node.attribute:
            if (attr.name == "spatial"):
                attr.i = 1
onnx.save(model, r'updated_resnet100.onnx')

I checked the results in ONNXRuntime (using the test data provided in the same link) after correction and the result looks okay. Please use the corrected model if you have immediate inferencing needs.

@hariharans29 I used updated_resnet100.onnx as your instruction above. Though it is run but the result seem to be wrong. Whether you can check again result model run on python and onnx runtime?

Did you use the official resnet100.onnx from the model zoo link or your converted model to make the update ?

I made the update on the official model and ran the test with all 3 test cases and the results are right.

As a double confirmation, another user made the same observation that making spatial == 1 in the same model here - microsoft/onnxruntime#831.

Quoting him - "By now I figured out that the model works correctly if you change the "spatial" attribute of all BatchNormalization nodes from 0 to 1. However, I'm not really sure why that helps".

I just gave an explanation above as to why that helps.

I just downloaded the arcface model again from https://github.com/onnx/models/tree/master/vision/body_analysis/arcface, using the link called "248.9 MB" in the "Download" column, and ONNX Runtime still reports the same problem:

RuntimeError: [ONNXRuntimeError] : 1 : GENERAL ERROR : Exception during initialization: D:\3\s\onnxruntime\core/providers/cpu/nn/batch_norm.h:39 onnxruntime::BatchNorm::BatchNorm spatial == 1 was false. BatchNormalization kernel for CPU provider does not support non-spatial cases

The model doesn't require spatial == 0. Can you please make the update to the model as suggested above and try running it ?

Hi @hariharans29,
I have downloaded model from model zoo and run your script to change spatial 0->1.
this is model link here
I tried with 2 different images but I get cosine distance = 0.96. So I think it is wrong( Because 2 different images must to get cosine distance of embbeding is small). Can you share script evaluate the model?


I did not use a script. I used the onnx test runner tool in the OnnxRuntime repo. It has the capability to consume input tensor protobufs and output tensor protobufs and compare results after tests. I downloaded the 3 test cases in the on x model zoo link (download with test data) and used the onnx test runner tool to run each test case and the output is correct.

What is the exact numerical cosine distance value you expect ? The definition of "wrong" results seems ridden with some hidden assumptions.

@hariharans29 can you check my model? here

I compute consine distance between two embbedings. if those two embbedings is a person that consine distance will near 1 opposite cosine distance will small and near 0.


I think it is the exact opposite. It is cosine "distance" (not similarity). When two people are different, cosine distance will near 1 and when they are the same, the value nears 0.

Hi @hariharans29 My script for cosine similarity
`def preprocess(input_data):

img_data = input_data.astype('float32')
img_data = img_data.reshape(1, 3, 112, 112)

mean_vec = np.array([0.485, 0.456, 0.406])
stddev_vec = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype('float32')
for i in range(img_data.shape[0]):
    norm_img_data[i,:,:] = (img_data[i,:,:]/255 - mean_vec[i]) / stddev_vec[i]

return norm_img_data

sess = rt.InferenceSession("/home/luandd/CLionProjects/untitled/updated_resnet100.onnx")
input_name = sess.get_inputs()[0].name
label_name = sess.get_outputs()[0].name

img = cv2.imread('/home/luandd/Downloads/trump-1.jpg')
img = cv2.resize(img,(112,112))
input_data = preprocess(img)

a = sess.run([label_name], {input_name: input_data})[0]

img = cv2.imread('/home/luandd/Downloads/barack-obama.jpeg')
img1 = cv2.resize(img,(112,112))
input_data = preprocess(img1)
b = sess.run([label_name], {input_name: input_data})[0]

cos_sim = dot(a[0], b[0])/(norm(a[0])*norm(b[0]))
I have tested 2 images above cos_sim = 0.9901112. I don't know why!!

The root cause is as @hariharans29 said in this. I found this link change the function convert PReLU from mxnet to onnxruntime.
It can fix this bug of mxnet converter.
After that, the model exported from MxNet with BatchNorm might not run because the "spatial=0" in BatchNormalization. Following this link

Another way, i wrote a script for convert the exported model from mxnet to onnx to add a Reshape layer before BatchNormalization layer and it works for me.

import onnx
from onnx import checker
import logging

model = onnx.load(r"mxnet2onnx_exported_bug_model.onnx")
onnx_processed_nodes = []
onnx_processed_inputs = []
onnx_processed_outputs = []
onnx_processed_initializers = []

reshape_node = []

for ind, node in enumerate(model.graph.node):
    if node.op_type == "PRelu":
        input_node = node.input
        input_bn = input_node[0]
        input_relu_gamma = input_node[1]
        output_node = node.output[0]
        input_reshape_name = "reshape{}".format(ind)
        slope_number = "slope{}".format(ind)

        node_reshape = onnx.helper.make_node(
            inputs=[input_relu_gamma, input_reshape_name],

        node_relu = onnx.helper.make_node(
            inputs=[input_bn, slope_number],
        onnx_processed_nodes.extend([node_reshape, node_relu])

        # If "spatial = 0" does not work for "BatchNormalization", change "spatial=1"
        # else comment this "if" condition
        if node.op_type == "BatchNormalization":
            for attr in node.attribute:
                if (attr.name == "spatial"):
                    attr.i = 1

list_new_inp = []
list_new_init = []
for name_rs in reshape_node:
    new_inp = onnx.helper.make_tensor_value_info(
    new_init = onnx.helper.make_tensor(
        vals=[1, -1, 1, 1]

for k, inp in enumerate(model.graph.input):
    if "relu0_gamma" in inp.name or "relu1_gamma" in inp.name: #or "relu_gamma" in inp.name:
        new_reshape = list_new_inp.pop(0)
        onnx_processed_inputs.extend([inp, new_reshape])

for k, outp in enumerate(model.graph.output):

for k, init in enumerate(model.graph.initializer):
    if "relu0_gamma" in init.name or "relu1_gamma" in init.name:
        new_reshape = list_new_init.pop(0)
        onnx_processed_initializers.extend([init, new_reshape])

graph = onnx.helper.make_graph(


# Check graph

onnx_model = onnx.helper.make_model(graph)

# Write model
str_input = '3,112,112'
input_shape = (1,) + tuple( [int(x) for x in str_input.split(',')] )
onnx_file_path = "mxnet2onnx_model_onnxruntime.onnx"

with open(onnx_file_path, "wb") as file_handle:
    serialized = onnx_model.SerializeToString()
    logging.info("Input shape of the model %s ", input_shape)
    logging.info("Exported ONNX file %s saved to disk", onnx_file_path)


Awesome work @duonglong289 ! I was able to convert ONNX model zoo ArcFace using onnxsimplifier and @hariharans29 script to TensorRT 7, but was struggling for two days to convert original InsightFace model, and your script helped in conversion.

If someone interested, I have made a script to convert original InsightFace model zoo ArcFace to ONNX and than to TensorRT, based on @duonglong289 script.

TensorRT outputs are same with MXNet outputs, so it can be a drop in replacement for MXNet model.
TRT inference code needs some cleanup and will be released later.

Please guide me about it . Thanks

@NaeemKhan333 , if you not so strict about specific ArcFace version, you could try using my converter, which builds TRT engine from original ArcFace model, which gives better accuracy than model provided in ONNX model zoo.
To use it you'll need docker, nvidia-container-toolkit and nvidia 450.xx drivers.
To build the engine you'll need:

  1. Clone the repo.
    git clone https://github.com/SthPhoenix/InsightFace-REST.git.
  2. Deploy conversion container:
    bash deploy_converter.sh
  3. Inside container shell execute script:
    python build_insight_trt.py

As result you'll get folder models inside repo's root, containing original MXNet model, same model converted to ONNX, and finally .plan file containing serialized TensorRT engine.

Engine will be built using TensorRT 7.1.3, if you can't use 450.xx drivers, you can edit src/Dockerfile.converter to use TensorRT:20.03 image instead of 20.09. You'll get TensorRT 7.0, which is not recommend.

Than you can just run build_insight_trt.py, but than you need to manually install mxnet==1.7.0, onnx==1.7.0, tensorrt>=7.0.0, and CUDA, cuDNN, compatible with you graphic driver

or I need to download original MXNet model.
can you guide me .Thanks

Script will do everything for you, but will use LResNet100E-IR,ArcFace@ms1m-refine-v2 model from Insightface model zoo

Script will do everything for you, but will use LResNet100E-IR,ArcFace@ms1m-refine-v2 model from Insightface model zoo

mxnet version: 1.6.0
onnx version: 1.7.0
Model file is not found. Downloading.
Downloading /models/mxnet/arcface_r100_v1.zip from http://insightface.ai/files/models/arcface_r100_v1.zip...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 237710/237710 [00:13<00:00, 17061.59KB/s]
Converting MXNet model to ONNX...
Creating intermediate copy of source model...
Applying RetinaFace specific fixes to input MXNet model before conversion...
Exporting to ONNX...
[08:27:16] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.0.0. Attempting to upgrade...
[08:27:16] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
Applying ArcFace specific fixes to output ONNX
Removing initializer from inputs in ONNX model...
Removing intermediate *.symbol and *.params
Building TensorRT engine...
[TensorRT] WARNING: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
TensorRT model ready.

By default it's model location is /model dir, since I'm mounting it inside docker, you can check this dir at disk root, or just change this path in build_insight_trt.py

BTW: I think this conversation is already out of scope of original issue, feel free to open an issue at my repo if you got any problems.

