quic/ai-hub-models

[BUG] fails to create context from binary for the 4 QNN Context Bin files (llama_v2_7b_chat_quantized_TokenGenerator_x_Quantized.bin)

Closed this issue · 3 comments

Hi,
It fails to create context from binary for the 4 QNN Context Bin files (llama_v2_7b_chat_quantized_TokenGenerator_1_Quantized.bin, llama_v2_7b_chat_quantized_TokenGenerator_2_Quantized.bin, llama_v2_7b_chat_quantized_TokenGenerator_3_Quantized.bin, llama_v2_7b_chat_quantized_TokenGenerator_4_Quantized.bin), in the Android mobile S24 Ultra.
even though it succeed to create context from binary One by One, it means that it succeed to create context from binary for one llama_v2_7b_chat_quantized_TokenGenerator_1_Quantized.bin, and executeGraphs it and freeContext, then create context from binary for one file llama_v2_7b_chat_quantized_TokenGenerator_2_Quantized.bin, and executeGraphs it and freeContext, and then until llama_v2_7b_chat_quantized_TokenGenerator_4_Quantized.bin, and executeGraphs it and freeContext.

How to create context from binary for the 4 QNN Context Bin files (llama_v2_7b_chat_quantized_TokenGenerator_1_Quantized.bin, llama_v2_7b_chat_quantized_TokenGenerator_2_Quantized.bin, llama_v2_7b_chat_quantized_TokenGenerator_3_Quantized.bin, llama_v2_7b_chat_quantized_TokenGenerator_4_Quantized.bin) ?

The fail log is following.

2024-08-07 17:34:36.233 9745-9772 LlamaNative com.test.llama I [taeyeon] QNN System function pointers are populated
2024-08-07 17:34:36.233 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to create system handle. m_qnnFunctionPointers.qnnSystemInterface.systemContextCreate
2024-08-07 17:34:36.233 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to allocate memory. bufferSize = 1083146928
2024-08-07 17:34:37.441 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to read binary data = /data/local/tmp/sample_app/llama_v2_7b_chat_quantized_TokenGenerator_1_Quantized.bin
2024-08-07 17:34:37.441 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to get context binary info. binaryInfoSize = 0
2024-08-07 17:34:37.441 9745-9772 LlamaNative com.test.llama I Extracting graphsInfo for graph Idx: 0
2024-08-07 17:34:37.441 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 516
2024-08-07 17:34:37.442 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 513
2024-08-07 17:34:37.442 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to copy metadata. m_graphsCount[4] = 1
2024-08-07 17:34:37.446 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 1, latency 100 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:37.447 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:1872: manage_poll_qos: poll mode updated to 3 for domain 3, handle 0x7cfa118450 for timeout 9999
2024-08-07 17:34:37.447 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 3, latency 9999 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:37.757 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to create context from binary.
2024-08-07 17:34:37.757 9745-9772 LlamaNative com.test.llama I [taeyeon] graphRetrieve graphName: tmp0b0a0nm3
2024-08-07 17:34:37.757 9745-9772 LlamaNative com.test.llama I [taeyeon] graphRetrieve graphName: tmp0b0a0nm3
2024-08-07 17:34:37.832 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to allocate memory. bufferSize = 821006960
2024-08-07 17:34:38.711 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to read binary data = /data/local/tmp/sample_app/llama_v2_7b_chat_quantized_TokenGenerator_2_Quantized.bin
2024-08-07 17:34:38.711 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to get context binary info. binaryInfoSize = 0
2024-08-07 17:34:38.711 9745-9772 LlamaNative com.test.llama I Extracting graphsInfo for graph Idx: 0
2024-08-07 17:34:38.711 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 516
2024-08-07 17:34:38.711 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 513
2024-08-07 17:34:38.712 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to copy metadata. m_graphsCount[5] = 1
2024-08-07 17:34:38.716 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 1, latency 100 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:38.716 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:1872: manage_poll_qos: poll mode updated to 3 for domain 3, handle 0x7cfa118450 for timeout 9999
2024-08-07 17:34:38.716 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 3, latency 9999 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:39.218 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to create context from binary.
2024-08-07 17:34:39.218 9745-9772 LlamaNative com.test.llama I [taeyeon] graphRetrieve graphName: tmp5a2tztgk
2024-08-07 17:34:39.218 9745-9772 LlamaNative com.test.llama I [taeyeon] graphRetrieve graphName: tmp5a2tztgk
2024-08-07 17:34:39.278 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to allocate memory. bufferSize = 821002864
2024-08-07 17:34:40.172 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to read binary data = /data/local/tmp/sample_app/llama_v2_7b_chat_quantized_TokenGenerator_3_Quantized.bin
2024-08-07 17:34:40.172 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to get context binary info. binaryInfoSize = 0
2024-08-07 17:34:40.172 9745-9772 LlamaNative com.test.llama I Extracting graphsInfo for graph Idx: 0
2024-08-07 17:34:40.173 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 516
2024-08-07 17:34:40.173 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 513
2024-08-07 17:34:40.173 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to copy metadata. m_graphsCount[6] = 1
2024-08-07 17:34:40.177 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 1, latency 100 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:40.178 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:1872: manage_poll_qos: poll mode updated to 3 for domain 3, handle 0x7cfa118450 for timeout 9999
2024-08-07 17:34:40.178 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 3, latency 9999 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:40.722 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to create context from binary.
2024-08-07 17:34:40.722 9745-9772 LlamaNative com.test.llama I [taeyeon] graphRetrieve graphName: tmpxdxtl6kr
2024-08-07 17:34:40.722 9745-9772 LlamaNative com.test.llama I [taeyeon] graphRetrieve graphName: tmpxdxtl6kr
2024-08-07 17:34:40.783 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to allocate memory. bufferSize = 952636560
2024-08-07 17:34:41.809 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to read binary data = /data/local/tmp/sample_app/llama_v2_7b_chat_quantized_TokenGenerator_4_Quantized.bin
2024-08-07 17:34:41.810 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to get context binary info. binaryInfoSize = 0
2024-08-07 17:34:41.810 9745-9772 LlamaNative com.test.llama I Extracting graphsInfo for graph Idx: 0
2024-08-07 17:34:41.810 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 516
2024-08-07 17:34:41.810 9745-9772 LlamaNative com.test.llama I Extracting tensorInfo for tensor tensorsCount : 513
2024-08-07 17:34:41.810 9745-9772 LlamaNative com.test.llama I [taeyeon] succeed to copy metadata. m_graphsCount[7] = 1
2024-08-07 17:34:41.814 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 1, latency 100 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:41.815 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:1872: manage_poll_qos: poll mode updated to 3 for domain 3, handle 0x7cfa118450 for timeout 9999
2024-08-07 17:34:41.815 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2331: remote_handle_control_domain: requested QOS 3, latency 9999 for domain 3 handle 0x7cfa118450
2024-08-07 17:34:42.300 9745-9772 com.test.llama com.test.llama E vendor/qcom/proprietary/adsprpc/src/fastrpc_mem.c:511: Error 0x1: fastrpc_mmap failed to map buffer fd 178, addr 0x79851b2000, length 0x38a00000, domain 3, flags 0x3, ioctl ret 0xffffffff, errno Bad address
2024-08-07 17:34:42.304 9745-9772 com.test.llama com.test.llama I vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:3407: open_device_node: no access to default device of domain 3, open thru HAL, (sess_id 2)
2024-08-07 17:34:42.305 9745-9772 dsp-client com.test.llama E DspClient.cpp (127): Error: open_hal_session: invalid argument(s): client instance 0x7ea9f89530, domain 11
2024-08-07 17:34:42.305 9745-9772 com.test.llama com.test.llama E vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:3426: Error 0x0: open_device_node failed for domain ID 11, sess ID 2 (errno 13, Permission denied)
2024-08-07 17:34:42.305 9745-9772 com.test.llama com.test.llama E vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2088::Error: 0x200: 0 <= (dev = open_device_node((int)domain))
2024-08-07 17:34:42.305 9745-9772 com.test.llama com.test.llama W vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2145:Warning 0x200: remote_get_info failed to get attribute 1 for domain 11 (errno Permission denied)
2024-08-07 17:34:42.305 9745-9772 com.test.llama com.test.llama E vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2371: Error 0x200: remote_handle_control_domain failed for request ID 2 on domain 3 (errno Permission denied)
2024-08-07 17:34:42.305 9745-9772 com.test.llama com.test.llama E vendor/qcom/proprietary/adsprpc/src/fastrpc_apps_user.c:2382: Error 0x200: remote_handle_control failed for request ID 2 (errno Permission denied)
2024-08-07 17:34:42.308 9745-9772 LlamaNative com.test.llama E Could not create context from binary.
2024-08-07 17:34:42.308 9745-9772 LlamaNative com.test.llama E Cleaning up graph Info structures.
2024-08-07 17:34:42.382 9745-9772 LlamaNative com.test.llama E ERROR Create From Binary failure

Best regards,

@taeyeonlee What command/script are you running here?

@bhushan23
QnnSampleApp::createFromBinary() is called as like below.

 std::vector<size_t> models = {app->MODEL_TokenGenerator_1, app->MODEL_TokenGenerator_2, app->MODEL_TokenGenerator_3, app->MODEL_TokenGenerator_4};
            if (sample_app::StatusCode::SUCCESS != app->createFromBinary(models)) {            

The QnnSampleApp.cpp is modified as below.

            
sample_app::StatusCode sample_app::QnnSampleApp::createFromBinary(std::vector<size_t> models) {

  if (nullptr == m_qnnFunctionPointers.qnnSystemInterface.systemContextCreate ||
      nullptr == m_qnnFunctionPointers.qnnSystemInterface.systemContextGetBinaryInfo ||
      nullptr == m_qnnFunctionPointers.qnnSystemInterface.systemContextFree) {
    ALOGE("QNN System function pointers are not populated.");
    return StatusCode::FAILURE;
  }

  auto returnStatus = StatusCode::SUCCESS;
  QnnSystemContext_Handle_t sysCtxHandle{nullptr};
  if (QNN_SUCCESS != m_qnnFunctionPointers.qnnSystemInterface.systemContextCreate(&sysCtxHandle)) {
    ALOGE("Could not create system handle.");
    return StatusCode::FAILURE;
  }

  for (size_t i = 0; i < models.size(); i++) {
    size_t model_index = models[i];
    uint64_t bufferSize{0};
    std::shared_ptr<uint8_t> buffer{nullptr};
    // read serialized binary into a byte buffer
    tools::datautil::StatusCode status{tools::datautil::StatusCode::SUCCESS};
    std::tie(status, bufferSize) = tools::datautil::getFileSize(m_cachedBinaryPath[model_index]);
    if (0 == bufferSize) {
      ALOGE("Received path to an empty file. Nothing to deserialize.");
      return StatusCode::FAILURE;
    }

    buffer = std::shared_ptr<uint8_t>(new uint8_t[bufferSize], std::default_delete<uint8_t[]>());
    if (!buffer) {
      ALOGE("Failed to allocate memory.");
      return StatusCode::FAILURE;
    }

    status = tools::datautil::readBinaryFromFile(
        m_cachedBinaryPath[model_index], reinterpret_cast<uint8_t*>(buffer.get()), bufferSize);
    if (status != tools::datautil::StatusCode::SUCCESS) {
      ALOGE("Failed to read binary data.");
      return StatusCode::FAILURE;
    }

    const QnnSystemContext_BinaryInfo_t* binaryInfo{nullptr};
    Qnn_ContextBinarySize_t binaryInfoSize{0};
    if (StatusCode::SUCCESS == returnStatus &&
        QNN_SUCCESS != m_qnnFunctionPointers.qnnSystemInterface.systemContextGetBinaryInfo(
                          sysCtxHandle,
                          static_cast<void*>(buffer.get()),
                          bufferSize,
                          &binaryInfo,
                          &binaryInfoSize)) {
      ALOGE("Failed to get context binary info");
      returnStatus = StatusCode::FAILURE;
    }

    // fill GraphInfo_t based on binary info
    if (StatusCode::SUCCESS == returnStatus) {
      qnn_wrapper_api::GraphInfo_t ** graphsInfo;
      uint32_t graphsCount;
      if (!copyMetadataToGraphsInfo(binaryInfo, graphsInfo, graphsCount)) {
        ALOGE("Failed to copy metadata.");
        returnStatus = StatusCode::FAILURE;
      } else {
        m_graphsInfo[model_index] = graphsInfo;
        m_graphsCount[model_index] = graphsCount;
      }
    }

    if (StatusCode::SUCCESS == returnStatus &&
        nullptr == m_qnnFunctionPointers.qnnInterface.contextCreateFromBinary) {
      ALOGE("contextCreateFromBinaryFnHandle is nullptr.");
      returnStatus = StatusCode::FAILURE;
    }

    if (StatusCode::SUCCESS == returnStatus &&
        m_qnnFunctionPointers.qnnInterface.contextCreateFromBinary(
            m_backendHandle,
            m_deviceHandle,
            (const QnnContext_Config_t**)&m_contextConfig,
            static_cast<void*>(buffer.get()),
            bufferSize,
            &m_context,
            m_profileBackendHandle)) {
      ALOGE("Could not create context from binary.");
      returnStatus = StatusCode::FAILURE;
    }
    else {
        ALOGI("[taeyeon] succeed to create context from binary.");
    }

    if (ProfilingLevel::OFF != m_profilingLevel) {
      extractBackendProfilingInfo(m_profileBackendHandle);
    }

    if (StatusCode::SUCCESS == returnStatus) {
      for (size_t graphIdx = 0; graphIdx < m_graphsCount.at(model_index); graphIdx++) {
        if (nullptr == m_qnnFunctionPointers.qnnInterface.graphRetrieve) {
          ALOGE("graphRetrieveFnHandle is nullptr.");
          returnStatus = StatusCode::FAILURE;
          break;
        }
				ALOGI("[taeyeon] graphRetrieve graphName: %s", (*m_graphsInfo.at(model_index))[graphIdx].graphName);
        if (QNN_SUCCESS !=
            m_qnnFunctionPointers.qnnInterface.graphRetrieve(
                m_context, (*m_graphsInfo.at(model_index))[graphIdx].graphName, &((*m_graphsInfo.at(model_index))[graphIdx].graph))) {
          ALOGE("Unable to retrieve graph handle for graph Idx: %d", graphIdx);
          returnStatus = StatusCode::FAILURE;
        }
          ALOGI("[taeyeon] graphRetrieve graphName: %s", (*m_graphsInfo.at(model_index))[graphIdx].graphName);
      }
    }

    if (StatusCode::SUCCESS != returnStatus) {
      ALOGE("Cleaning up graph Info structures.");
      qnn_wrapper_api::freeGraphsInfo(&m_graphsInfo.at(model_index), m_graphsCount.at(model_index));
    }
  }

  m_qnnFunctionPointers.qnnSystemInterface.systemContextFree(sysCtxHandle);
  sysCtxHandle = nullptr;

  m_isContextCreated = true;

  return returnStatus;
}            

Hi @taeyeonlee please follow https://github.com/quic/ai-hub-models/tree/main/qai_hub_models/models/llama_v2_7b_chat_quantized/gen_ondevice_llama to run llama2 models on device with Genie.

We will soon be releasing C++ app using Genie C++ APIs