larq/compute-engine

Conversion with int8 input/output problems

GhaziXX opened this issue · 3 comments

Hello, I am trying to convert a pre-made model from the model-zoo to tflite and use int8 as input and output type.
A couple of problems occurred:

  1. When converting without adding the experimental_default_int8_range argument the input/output type won't change to int8
  2. When adding the experimental_default_int8_range argument with a value of (0,255) as the range of the values in an image I get the following output:
/usr/local/lib/python3.7/dist-packages/larq_compute_engine/mlir/python/converter.py:91: UserWarning: Using `experimental_default_int8_range` as fallback quantization stats. This should only be used for latency tests.
  "Using `experimental_default_int8_range` as fallback quantization stats. "
WARNING:absl:Found untraced functions such as ste_sign_50_layer_call_and_return_conditional_losses, ste_sign_50_layer_call_fn, ste_sign_51_layer_call_and_return_conditional_losses, ste_sign_51_layer_call_fn, ste_sign_52_layer_call_and_return_conditional_losses while saving (showing 5 of 75). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: /tmp/tmpnnjll44k/assets
INFO:tensorflow:Assets written to: /tmp/tmpnnjll44k/assets
/usr/local/lib/python3.7/dist-packages/larq_compute_engine/mlir/python/converter.py:91: UserWarning: Using `experimental_default_int8_range` as fallback quantization stats. This should only be used for latency tests.
  "Using `experimental_default_int8_range` as fallback quantization stats. "
  1. After doing this, for QuickNet family from the sota package, everything works okay when I try to run the inference and I get the input and output types as int8. But, if I try with another model architecture from the literature package, when I try to run inference I get the following error:
$ converted = lce.convert_keras_model(m, target="arm",inference_input_type=tf.int8,inference_output_type=tf.int8,experimental_default_int8_range=(0,255))
$ with open("t2.tflite", "wb") as f:
     f.write(converted)
$ buf = open('t2.tflite', 'rb').read()
$ interpreter = lce.tflite.python.interpreter.Interpreter(buf)
$ print(interpreter.input_types)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-122-2d98c98eb9a7> in <module>()
      1 buf = open('t2.tflite', 'rb').read()
----> 2 interpreter = lce.tflite.python.interpreter.Interpreter(buf)
      3 interpreter.input_types

/usr/local/lib/python3.7/dist-packages/larq_compute_engine/tflite/python/interpreter.py in __init__(self, flatbuffer_model, num_threads, use_reference_bconv)
     66     ):
     67         self.interpreter = interpreter_wrapper_lite.LiteInterpreter(
---> 68             flatbuffer_model, num_threads, use_reference_bconv
     69         )
     70 

RuntimeError: ERROR at larq_compute_engine/tflite/python/interpreter_wrapper_lite.cc:49 : the following was false: interpreter_->AllocateTensors() == kTfLiteOk

Hi @GhaziXX,

You are right, experimental_default_int8_range is needed because all these models have been trained without int8 quantization awareness.

As for the error, it seems some of the output from the interpreter is not printed. If I have to guess, its because the literature models use binary convolutions with 0-padding, whereas the compute-engine binary convolutions expect 1-padding (which QuickNet has). You could try use_reference_bconv=True as parameter in Interpreter(buf, use_reference_bconv=True) which should support both types of padding.

Hi @GhaziXX,

You are right, experimental_default_int8_range is needed because all these models have been trained without int8 quantization awareness.

As for the error, it seems some of the output from the interpreter is not printed. If I have to guess, its because the literature models use binary convolutions with 0-padding, whereas the compute-engine binary convolutions expect 1-padding (which QuickNet has). You could try use_reference_bconv=True as parameter in Interpreter(buf, use_reference_bconv=True) which should support both types of padding.

Thank you @Tombana, that worked, maybe that flag could be set to True by default to avoid this error if anyone faces it sometimes.
Anyway, thank you very much for the help and explanation!

@GhaziXX beware that post-training quantization (which the experimental_default_int8_range flag triggers) and binarization don't go well together and you should assume there will be a severe accuracy degradation, to the point the converted model performs close to random. So this is only useful for latency benchmarking.