thin-edge/thin-edge.io

too many software_list commands are created on tedge-mapper c8y startup when using nanomq

reubenmiller opened this issue · 10 comments

Describe the bug

Update this behaviour is only observed when using the nanomq MQTT broker. The behaviour is not observed when using mosquitto.

In a highly customized setup (details of the setup are shown later), Starting the tedge-mapper c8y (for the first time) triggers multiple te/device/main///cmd/software_list commands on startup, 29 software_list commands were created within 1 second.

For example, the following topics were created:

te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.893595279Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.894731237Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.895986612Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.897403279Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.898662695Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.899841445Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.900913403Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.901975486Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.903152694Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.904377861Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.905543861Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.906903694Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.90813136Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.909227068Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.910274943Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.911272818Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.912283985Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.913332985Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.914694693Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.916071984Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.917349984Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.918492192Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.919662234Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.920877775Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.922001275Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.923084441Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.924247525Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.925334524Z
te/device/main///cmd/software_list/c8y-mapper-2024-06-19T18:06:24.926390358Z

The full logs are available here

Note the bug has been observed in a highly customized environment, and it is currently unclear what is the root cause of the error:

  • The rootfs is read-only except for, /data, /tmp and /run partitions
  • binaries are packed using upx to reduce disk size
    • tedge (11MB -> 3.5MB)
    • nanomq (~2.5MB -> 621KB)
  • tedge is installed under a custom location:
    • binaries are stored under /data/bin
    • tedge configuration is stored under /data/etc/tedge
  • tedge components are started manually and run under the root user (as the tedge user does not exist, and can't be added)
  • nanomq is used for the local MQTT broker
  • the tedge-mapper is used with the built-in mapper function enabled

To Reproduce

The following requires following repository, thin-edge/debug-tedge-readonly as there are a lot of customizations:

It also currently relies on a aarch64/arm64 host to run the example (though once the example is better understood, an system test case can be created).

  1. Start up the docker container

    just up
  2. Open a shell in the container, and configure the

    just shell
    ./entrypoint.sh

    You should follow the on-screen instructions to then upload the device certificate

  3. Start the custom MQTT broker

    ./bin/nanomq start
  4. Open a new shell

    just shell
  5. Start both the tedge-mapper and tedge-agent processes (not this is starting wrappers)

    source ./entrypoint.sh
    tedge_agent &
    tedge_mapper_c8y &

Expected behavior

On startup, only one software_list command should be created.

Screenshots

Environment (please complete the following information):

Property Value
OS [incl. version] Alpine Linux v3.18
Hardware [incl. revision] container
System-Architecture Linux 0e1925ccd1a0 6.5.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 22:58:57 UTC 2024 aarch64 GNU/Linux
thin-edge.io version tedge 1.1.2~127+g39857ac

Additional context

Having a look to the log

  1. This log is a mix of agent and mapper output.
    That's okay, but it would help to also have the MQTT messages exchange between the two processes.
    I.e. to gather the output of:
    $ tedge mqtt sub 'te/#' &
    $ tedge_mapper_c8y &
    $ tedge_agent &
  2. The mapper issued a single software list request c8y-mapper-2024-06-19T18
  3. But the agent processed this request 29 times!
  4. A weird but a systematic pattern. The request is
    • scheduled once
    • rejected 9 times (because being duplicated)
    • reported as successful (awaiting to be cleared by the mapper)
  5. The agent starts by reporting 29 successful requests waiting to be cleared.
    • These requests are resumed on restart
    • Did the mapper failed to acknowledge those during the previous run?
    • Is this number of 29 increasing after each run?
    • It would be good to see the persisted operations in /etc/tedge/.agent/workflows
      (in this specific setting: /data/etc/tedge/.agent/workflows).
  6. There are weird rumqttc errors:
    • INFO rumqttc::state: Collision on packet id = 7
    • It would be good to know if on the agent, the mapper or the bridge
  7. Harmless:
    • WARN tedge_config::sudo: sudo.enableset totrue, but sudo not found in $PATH

@didier-wenzek I've added logs to the repository for the individual components, and included the .agent/workflow file after the test was stopped

Though the nanomq.log has a lot of error messages, so maybe something is not correct in the nanomq build, as it does not seem to respect the QoS (for unknown reasons).

The agent is persisting (in .agent/workflows) plenty of successful software list requests.

  • The agent never received clearing messages for those.
  • All these requests have been generated by the mapper in less then a 1 second!

The MQTT log confirms that:

  • all the software list requests generated by the mapper are properly executed but never cleared:
    there is no empty retain message published on the te/device/main///cmd/software_list/+ topics.
  • All the retain messages are duplicated ad nauseam:
    the mapper reacts to each software update capability by generating a software list request.

Here is a first difference between mosquitto vs nanomq that explains message duplicates:

If a client subscribes to overlapping topic filters (say test/+/xxx and test/xxx/+),
then a message published on the intersection (say on test/xxx/xxx):

  • is delivered once to the subscriber by mosquitto
  • but is delivered as many as matching subscriptions by nanomq

And indeed the c8y mapper is subscribing to overlapping topic filters:

te/+/+/+/+/cmd/software_list/+
te/+/+/+/+/cmd/software_list
te/device/main///cmd/software_list/+

If a client subscribes to overlapping topic filters (say test/+/xxx and test/xxx/+),
then a message published on the intersection (say on test/xxx/xxx):

  • is delivered once to the subscriber by mosquitto
  • but is delivered as many as matching subscriptions by nanomq

According to the MQTT specs, both behavior are accepted:

it is possible for a Client’s subscriptions to overlap so that a published message might match multiple filters. In this case the Server MUST deliver the message to the Client respecting the maximum QoS of all the matching subscriptions [MQTT-3.3.5-1]. In addition, the Server MAY deliver further copies of the message, one for each additional matching subscription and respecting the subscription’s QoS in each case.

The latest version of NanoMQ has an incorrect behavior when a client has several subscriptions.
It incorrectly delivers a copy of the message for each subscription even if unrelated :

$ docker run -d --name nanomq -p 1883:1883 -p 8083:8083 -p 8883:8883 emqx/nanomq:latest
$ mosquitto_pub -r -t sport -m tennis
$ mosquitto_sub -t sport
tennis
^C
$ mosquitto_sub -t unrelated
^C
$ mosquitto_sub -t sport -t unrelated
tennis
tennis
^C

A ticket has been created in the nanomq, nanomq/nanomq#1817, where it seems there was a recent change that introduced this erroneous behaviour for retained messages

Sorry to break this. 0.22.0 is a beta release, hopefully I didn't make too much trouble for you

Sorry to break this. 0.22.0 is a beta release, hopefully we I didn't make too much trouble for you

All good. We actually stumbled across it as we were building a static binary from the main branch…it took us a while to find out the reason for the unexpected behaviour, but glad we’re able to help identify the issue in the end.

QA tested the bug and it is not reproducable, creating test case to cover this bug is in progress