[BUG] IOT BYOM Issue: <insert issue here>

Describe the issue
I trained a customized model using Ultralytics repository in the following commit version: 3208eb72ef277b0b825306a84df6c460a8406647. The model was trained on a custom dataset that detects persons and forklifts.

The output model was a .pt file named forklifts.pt. I want to convert the model to .tflite and quantize it to run on QCS6490 chipset. I used AI-HUB package, running the following:

python -m qai_hub_models.models.yolov8_det_quantized.export --device "QCS6490 (Proxy)" --ckpt-name "/home/mbenencase/Downloads/forklifts.pt" --output-dir forklifts_new

This is the job details link: https://app.aihub.qualcomm.com/jobs/jp14q9qlp/

The problem relies on me not being able to delegate all the nodes to the NPU. This is the output available on the job link above:

------------------------------------------------------------
Performance results on-device for Yolov8_Det_Quantized.
------------------------------------------------------------
Device                          : QCS6490 (Proxy) (12)                   
Runtime                         : TFLITE                                 
Estimated inference time (ms)   : 51.0                                   
Estimated peak memory usage (MB): [1, 37]                                
Total # Ops                     : 267                                    
Compute Unit(s)                 : GPU (227 ops) NPU (10 ops) CPU (30 ops)
------------------------------------------------------------
More details: https://app.aihub.qualcomm.com/jobs/jp14q9qlp/

tmp0eymkt7b.h5: 100%|████████████████████████████████████████████████████████| 154k/154k [00:00<00:00, 203kB/s]

Comparing on-device vs. local-cpu inference for Yolov8_Det_Quantized.
+---------------+--------------+--------+
| output_name   | shape        |   psnr |
+===============+==============+========+
| boxes         | (1, 8400, 4) |  73.76 |
+---------------+--------------+--------+
| scores        | (1, 8400)    |  62.87 |
+---------------+--------------+--------+

- psnr: Peak Signal-to-Noise Ratio (PSNR). >30 dB is typically considered good.

which is odd because I can run all the model nodes of the yolov8 quantized model available on AI-HUB to the DSP.

To Reproduce
job link: https://app.aihub.qualcomm.com/jobs/jp14q9qlp/

Host configuration:

Ubuntu 20:04
Chrome

Hi Marcelo,

It looks like the model you profiled in the link is FP32. That's why it's not running on the HTP (NPU) on your device--your device does not support FP32 inference on the HTP.

Looking further, it looks like you uploaded a model with empty AIMET encodings:
https://app.aihub.qualcomm.com/models/mnjx3p91q

The problem is we don't use existing AIMET encodings when you provide your own weights. This is a bug (we should quantize for you in this case) that will be resolved for YOLOv8 in our next release.

In the mean time, there are a few ways you can resolve this:

First, Try to quantize with QuantizeJob.
You can follow the example in

ai-hub-models/qai_hub_models/models/xlsr_quantized/export.py

Line 131 in 2fc5329

# 2. Converts the PyTorch model to ONNX and quantizes the ONNX model.

.

Alternatively, use an example from this script to generate encodings:

Generate encodings with https://github.com/quic/ai-hub-models/blob/2fc5329766811cd92c7e140259866fe78a8315a1/scripts/examples/quantize_detector_coco.py
Trace and save model

model = Model.from_pretrained(...)
ts = model.convert_to_torchscript()
torch.jit.save(ts, "model.pt")

Copy model and encodings to directory

mkdir model
cd model
mv <path>/model.pt model.pt
mv <path>/model.encodings model_torch.encodings
cd ..
zip -r model.zip model

Upload model to hub & compile