Race condition in handle_storage_local_packet_push
twvd opened this issue · 3 comments
Hello,
While debugging the mesh code due to stability issues we are facing, I was getting error checks because a packet reference counter was 0.
The following code (in handle_storage_local_packet_push) caught my eye:
uint32_t error_code = event_handler_push(&evt);
if (error_code == NRF_SUCCESS)
{
mesh_packet_ref_count_inc(p_packet);
}
The reference counter is incremented here after checking the error code from event_handler_push, however; event_handler_push can potentially trigger the QDEC event handler IRQ immediately due to the following line in event_handler_push:
NVIC_SetPendingIRQ(EVENT_HANDLER_IRQ);
Looking further, the event callback added for the new packet (local_packet_push) decrements the reference counter once:
mesh_packet_ref_count_dec(p_packet); /* for the event queue */
So, in my understanding; it is possible for the event handler IRQ to decrement the reference counter to 0 before mesh_packet_ref_count_inc(p_packet) is called in handle_storage_local_packet_push; causing the APP_ERROR-check in mesh_packet_ref_count_inc because the reference counter is 0 at that point.
The solution would be incrementing the reference counter BEFORE calling event_handler_push, like so:
mesh_packet_ref_count_inc(p_packet);
uint32_t error_code = event_handler_push(&evt);
if (error_code != NRF_SUCCESS)
{
mesh_packet_ref_count_dec(p_packet);
}
This solves some stability issues we are having, but not all. Can anyone confirm this issue?
Hi there,
That's a nice catch, and your fix is perfectly valid. We'll try and merge it in for the next release.
Good to know, I wasn't sure because it works fine 99,9% of the time. Not sure why, probably bad timing influenced by event queue length and timeslots.
I spotted 2 more of these, see this pull request: #151
Looks great, thanks! 👍