Unexpected token after long running block transfer
anyc opened this issue · 9 comments
Hello,
I am currently investigating an issue with my libcoap-based server and client. The client sends a firmware update to the server and waits for a response if the update was successful. After the file was received, the server will process the update and send the answer after the update finished. On the client side, after the update finished I first see a XMIT_BLOCK_FAIL event and then I receive a PDU with an unexpected token. To me it looks like the original token is not correctly restored. Please see the following example where you can see the last transmitted block and then the errors:
Jul 11 18:38:14.548 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf4: added to retransmit queue (2500ms)
Jul 11 18:38:14.548 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv 46 bytes
v:1 t:ACK c:2.31 i:dcf4 {4c90000000000003} [ Block1:85135/M/1024 ]
Jul 11 18:38:14.548 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf4: removed (1)
Jul 11 18:38:14.548 DEBG found Block option, block size is 1024, block nr. 85135
Jul 11 18:38:14.548 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: sent 599 bytes
Jul 11 18:38:14.548 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: dtls: sent 570 bytes
v:1 t:CON c:GET i:dcf5 {4c91000000000003} [ Uri-Path:v1, Uri-Path:system, Uri-Path:fw_update, Block1:85136/_/1024, Size1:87179784, Request-Tag:0x030c538f ] :: binary data length 520
<<00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>>
<<. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >>
Jul 11 18:38:14.548 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf5: added to retransmit queue (2813ms)
Jul 11 18:38:14.549 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv 33 bytes
v:1 t:ACK c:0.00 i:dcf5 {} [ ]
Jul 11 18:38:14.549 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: mid=0xdcf5: removed (1)
Jul 11 18:38:51.449 DEBG ** 0.0.0.0:45416 <-> 224.0.1.187:5683 UDP : lg_crcv 0x135c840 released
Jul 11 18:38:51.450 DEBG ** 0.0.0.0:52842 <-> 224.0.1.187:5683 UDP : lg_crcv 0x132a850 released
Jul 11 18:38:51.450 DEBG ** 0.0.0.0:56228 <-> 224.0.1.187:5683 UDP : lg_crcv 0x132bb80 released
Jul 11 18:39:03.855 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_crcv 0x13e3cb0 released
Jul 11 18:39:47.552 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_xmit 0x13dfb10 released
Jul 11 18:39:47.552 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
Jul 11 18:39:54.906 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv 54 bytes
v:1 t:CON c:2.05 i:aeb5 {4c91000000000003} [ Block1:85136/_/1024 ] :: 'success'
unexpected rxtoken 216172782113821004 b'L\x91\x00\x00\x00\x00\x00\x03'
Jul 11 18:39:54.907 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: sent 33 bytes
Jul 11 18:39:54.907 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: dtls: sent 4 bytes
v:1 t:RST c:0.00 i:aeb5 {} [ ]
As far as I understood the process, libcoap only removed one byte from the token and not two bytes. Did I understand this correctly? If yes, do you have an idea what I might be doing wrong here? If I use coap-client I do not get an error regarding a wrong token but after looking at the coap-client code, I did not see something obvious.
On the server side I use block mode COAP_BLOCK_USE_LIBCOAP
and the client uses COAP_BLOCK_USE_LIBCOAP | COAP_BLOCK_SINGLE_BODY
.
Thank you!
Ah, I missed the 'L' character in my output. So the token still contains both bytes that are used for the block transfer.
If it helps, the output of coap-client:
Jul 12 11:02:19.859 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: mid=0xcd14: removed (1)
Jul 12 11:02:19.859 DEBG found Block option, block size is 1024, block nr. 85135
Jul 12 11:02:19.859 DEBG * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: sent 599 bytes
Jul 12 11:02:19.859 DEBG * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: dtls: sent 570 bytes
v:1 t:CON c:GET i:cd15 {4c91000000000003} [ Uri-Path:v1, Uri-Path:system, Uri-Path:fw_update, Block1:85136/_/1024, Size1:87179784, Request-Tag:0xddef4589 ] :: binary data length 520
<<00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>>
<<. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >>
Jul 12 11:02:19.859 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: mid=0xcd15: added to retransmit queue (2719ms)
Jul 12 11:02:19.859 DEBG * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: recv 33 bytes
v:1 t:ACK c:0.00 i:cd15 {} [ ]
Jul 12 11:02:19.859 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: mid=0xcd15: removed (1)
Jul 12 11:03:52.864 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: lg_xmit 0x63b9b277ce60 released
Jul 12 11:03:52.864 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
Jul 12 11:04:02.047 DEBG * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: recv 54 bytes
v:1 t:CON c:2.05 i:1d68 {4c91000000000003} [ Block1:85136/_/1024 ] :: 'success'
Jul 12 11:04:02.047 DEBG PDU presented to app.
v:1 t:CON c:2.05 i:1d68 {01} [ Block1:85136/_/1024 ] :: 'success'
Jul 12 11:04:02.051 DEBG ** [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: lg_crcv 0x63b9b277d440 released
Jul 12 11:04:02.051 DEBG ** process incoming 2.05 response:
successJul 12 11:04:02.051 DEBG * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: sent 33 bytes
Jul 12 11:04:02.051 DEBG * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: dtls: sent 4 bytes
v:1 t:ACK c:0.00 i:1d68 {} [ ]
Jul 12 11:04:02.051 DEBG * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: netif: sent 31 bytes
Jul 12 11:04:02.051 INFO * [fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: SSL3 alert write:warning:close notify
Jul 12 11:04:02.051 DEBG ***EVENT: COAP_EVENT_DTLS_CLOSED
Jul 12 11:04:02.051 DEBG ***[fe80::4dd8:5d0a:e344:f549]:60644 <-> [fe80::20f:69ff:feff:d6c4]:5684 DTLS: session 0x63b9b2766e70: closed
The XMIT_BLOCK_FAIL
seems to be "normal".
Several things here.
- We are not tracking all the blocks in the state token - so there will be a wrap around in the top 2 bytes in your case of a large file. Could be fixed, but does not help you here (85135 is 0x1c48f)
- I would expect you to be using PUT or POST for your file upload, not GET with a payload.
- It is the responsibility of the underlying libcoap client code to map the state token being used back to the original client token used at the start of the request - which is what has (correctly) happened here for coap-client (but not your client)
- In the client, the
lg_xmit
tracking is released after MAX_TRANSMIT_WAIT (93 seconds) following the empty ACK response for the last transmitted block (No 2.xx code received). Hence COAP_EVENT_XMIT_BLOCK_FAIL.
I think the clue is (which does not happen with coap-client)
Jul 11 18:39:03.855 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_crcv 0x13e3cb0 released
Which has timed out only 49 secs after
Jul 11 18:38:14.549 DEBG * 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: netif: recv 33 bytes
v:1 t:ACK c:0.00 i:dcf5 {} [ ]
Once lg_crcv
goes away, the logic has lost how to put back the correct token. However,
Jul 11 18:39:47.552 DEBG ** 192.168.0.13:54986 <-> 192.168.1.161:5684 DTLS: lg_xmit 0x13dfb10 released
which happened 93 seconds later indicates no change to MAX_RETRANSMIT which is used by both the lg_crcv
and lg_xmit
timeout logic. Looking into why lg_crcv
timed out in your client code.
See #1469 which fixes the early timeout of lg_crcv
after 49 seconds (my guess is that the file transfer started 44 secs previously). Please confirm if this helps.
However, If the time before Success is received after receipt of the empty ACK is greater than MAX_RETRANSMIT (93 secs), then there are still likely to be issues.
It may help here to send something like "Installing" after receipt of the firmware, and then send an unsolicited "Success" (or "Failure" or whatever) when the operation has completed (without any Block1 options).
The Server will need to track the initial Token used for block 0 and then use this Token for the unsolicited response. The client will need to keep track of the initial Token to accept the unsolicited response.
Alternatively the server can maintain a status of the installation (perhaps on the same URI, but with ?status
appended) which can then be "Observed" with the server sending the "Success" when ready (with Observe option removed to signal no more).
Thank you for looking into this!
Unfortunately, it looks the same to me:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >>
Jul 12 18:31:02.604 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: mid=0xf0c2: added to retransmit queue (2406ms)
Jul 12 18:31:02.604 DEBG * 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv 33 bytes
v:1 t:ACK c:0.00 i:f0c2 {} [ ]
Jul 12 18:31:02.604 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: mid=0xf0c2: removed (1)
Jul 12 18:31:33.273 DEBG ** 0.0.0.0:45388 <-> 224.0.1.187:5683 UDP : lg_crcv 0x13a4c00 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:35935 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae0c0 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:45326 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae970 released
Jul 12 18:32:35.608 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_crcv 0x1a333f0 released
Jul 12 18:32:35.611 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_xmit 0x1557340 released
Jul 12 18:32:35.611 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
Jul 12 18:32:49.701 DEBG * 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv 54 bytes
v:1 t:CON c:2.05 i:7f2e {4c89000000000005} [ Block1:85128/_/1024 ] :: 'success'
asd
unexpected rxtoken 360287970189674828 b'L\x89\x00\x00\x00\x00\x00\x05'
Jul 12 18:32:49.701 DEBG * 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: sent 33 bytes
Jul 12 18:32:49.701 DEBG * 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: dtls: sent 4 bytes
v:1 t:RST c:0.00 i:7f2e {} [ ]
Regarding:
It is the responsibility of the underlying libcoap client code to map the state token being used back to the original client token used at the start of the request - which is what has (correctly) happened here for coap-client (but not your client)
In the client code:
libcoap/examples/coap-client.c
Line 409 in bfdbdc3
the response handler immediately checks if the token is in its list of tokens without applying a mask or something to ignore the most significant two bytes. Hence, I thought libcoap would prepare received_pdu->token
for the application.
In the coap-client output, there seems to be another steps involved before the application callback is executed:
v:1 t:CON c:2.05 i:1d68 {4c91000000000003} [ Block1:85136/_/1024 ] :: 'success'
Jul 12 11:04:02.047 DEBG PDU presented to app.
v:1 t:CON c:2.05 i:1d68 {01} [ Block1:85136/_/1024 ] :: 'success'
I tried to follow the code flow around the DEBG line but I my knowledge about libcoap/CoAP is still too small to understand what happens why here.
I will try this again on Monday, thank you!
Now the lg_crcv
is not being removed after 49 secs, but after 93 secs (MAX_TIMEOUT) so that bit is working.
So
Jul 12 18:31:02.604 DEBG * 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv 33 bytes
v:1 t:ACK c:0.00 i:f0c2 {} [ ]
... MAX_TIMEOUT (93 secs) later
Jul 12 18:32:35.608 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_crcv 0x1a333f0 released
Jul 12 18:32:35.611 DEBG ** 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: lg_xmit 0x1557340 released
Jul 12 18:32:35.611 DEBG ***EVENT: COAP_EVENT_XMIT_BLOCK_FAIL
... 107 seconds after empty ACK
Jul 12 18:32:49.701 DEBG * 192.168.178.130:40209 <-> 192.168.178.36:5684 DTLS: netif: recv 54 bytes
v:1 t:CON c:2.05 i:7f2e {4c89000000000005} [ Block1:85128/_/1024 ] :: 'success'
(107 secs > 93 secs) is why we are running into an issue.
In the coap-client output, there seems to be another steps involved before the application callback is executed:
If lg_crcv
or lg_xmit
are still available, then the original token is available and is used by the libcoap lib code to replace the token as indicated by DEBG PDU presented to app.
. Is your client and coap-client running on the same server, or is coap-client running faster and so is able to do the transfer in less than 93 secs?
It appears that your client is also sending other (multicast) traffic as well, shown by the additional lg_crvc being removed.
Jul 12 18:31:33.273 DEBG ** 0.0.0.0:45388 <-> 224.0.1.187:5683 UDP : lg_crcv 0x13a4c00 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:35935 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae0c0 released
Jul 12 18:31:33.274 DEBG ** 0.0.0.0:45326 <-> 224.0.1.187:5683 UDP : lg_crcv 0x19ae970 released
This may be adding in additional delays which is taking the client over 93 secs.
You may want to consider changing MAX_RETRANSMIT (from which MAX_TIMEOUT is derived) from the default of 4 to 5 by using coap_session_set_max_retransmit()
. Then MAX_TIMEOUT will become 186 secs.
The clients are running on the same machine, yes. My client also shows a simple GUI and initially sends a multicast PDU to discover servers in the network. I think there shouldn't be a performance difference between my client and coap-client.
So, am I understanding this correctly: if I use the async functions from libcoap, this just enables my client to postpone sending the response to a request and does not send some sort of "keepalive" message from time to time after the initial empty ACK response? So I could increase the timeout you mentioned but if the update procedure on the server could possibly take an arbitrary long time this is never truly safe and I should switch to a different signaling procedure like you suggested?
Thank you!
My suggestion is that you go for something like
POST /v1/system/fw_update
...
2.31
...
2.01 (or 2.04) Uploaded (Sent back as soon as upload has finished)
GET /v1/system/fw_update Set Observe option
2.05 Installing Observe count set
...
2.05 Rebooting Observe count set
...
2.05 Success Observe count set
GET /v1/system/fw_update Cancel Observe option
2.05 Success
The Observe can then report back whenever there is a status change for the F/W update process for however long that process takes.
Sounds good, thank you!