NordicPlayground/nRF51-ble-bcast-mesh

Race condition in handle_storage_local_packet_push

twvd opened this issue · 3 comments

twvd commented

Hello,

While debugging the mesh code due to stability issues we are facing, I was getting error checks because a packet reference counter was 0.
The following code (in handle_storage_local_packet_push) caught my eye:

    uint32_t error_code = event_handler_push(&evt);
    if (error_code == NRF_SUCCESS)
    {
        mesh_packet_ref_count_inc(p_packet);
    }

The reference counter is incremented here after checking the error code from event_handler_push, however; event_handler_push can potentially trigger the QDEC event handler IRQ immediately due to the following line in event_handler_push:

    NVIC_SetPendingIRQ(EVENT_HANDLER_IRQ);

Looking further, the event callback added for the new packet (local_packet_push) decrements the reference counter once:

    mesh_packet_ref_count_dec(p_packet); /* for the event queue */

So, in my understanding; it is possible for the event handler IRQ to decrement the reference counter to 0 before mesh_packet_ref_count_inc(p_packet) is called in handle_storage_local_packet_push; causing the APP_ERROR-check in mesh_packet_ref_count_inc because the reference counter is 0 at that point.
The solution would be incrementing the reference counter BEFORE calling event_handler_push, like so:

    mesh_packet_ref_count_inc(p_packet);
    uint32_t error_code = event_handler_push(&evt);
    if (error_code != NRF_SUCCESS)
    {
        mesh_packet_ref_count_dec(p_packet);
    }

This solves some stability issues we are having, but not all. Can anyone confirm this issue?

Hi there,

That's a nice catch, and your fix is perfectly valid. We'll try and merge it in for the next release.

twvd commented

Good to know, I wasn't sure because it works fine 99,9% of the time. Not sure why, probably bad timing influenced by event queue length and timeslots.

I spotted 2 more of these, see this pull request: #151

Looks great, thanks! 👍