candle-usb/candleLight_fw

canable device hangs after the host computer reboot

xdaco opened this issue · 21 comments

xdaco commented

Hi All,

I have a very strange issue. I am using a canbale clone hardware with candlelight firmware. The clone is a part of another pcb and the STM32F0 is powered separately in stead of usb 5v. Therefore while using the device communicating with another can device, if the host computer reboots, the canbale device gets hanged. The device is correctly getting configured on the host computer. But I cannot send/receive any data until I reset the power of STM32F0.

The probable solution is to power the STM32F0 by USB 5v which is now not possible as the STM32F0 is part of another board. My question is is there a way to reset the STM32F0 from the USB side from the HOST computer? or Is this the expected behavior?

Hm, interesting edge case. I'm not well-versed in USB reset / suspend behaviour, but I think the firmware currently does nothing to detect and react to either USB Suspend or reset. In the case of a host reboot I don't think suspend is applicable...

Some of the relevant code is at https://github.com/candle-usb/candleLight_fw/blob/master/src/usbd_conf.c#L91/

After rebooting, is the device still accessible ? i.e. shows up in lsusb, and responds to e.g. "ip link set can0 down"

xdaco commented

Hi @fenugrec Thanks for replying. The device is still available after reboot but at down state.
The device is visible with $ ifconfig -a and comes back when we do $ ip link set can0 up.
We can not communicate with the bus. candump does not show any packets while on the bus there is an active slave which sends continuous packets.

I will also have a look at the code snippet which we pointed out. If we can detect the USB reset / suspend, then we can do something to solve this problem.

That sounds a bit like some issues I had when the device is alone on the CAN bus (hence no ACK, and the peripheral gets stuck repeating always). I assume it doesn't help if you bring the interface down then up again ? (while specifying the bitrate just in case)

xdaco commented

Interface the up and down did not help.

xdaco commented

#46 seems to be solved with commit d13b6db

I'd really like to reproduce the issue here . I have two identical candlelight devices on 2 separate machines; I bring up "ip link set can0 up...." on both, then do quick "canfdtest" run to make sure they talk, then

  • stop canfdtest
  • reboot machine B
  • I can run canfdtest again without problems ?

Also tried to shutdown B, in which case the device was turned completely off. I did need to "ip link set up. .." again, but it worked.

Can you explain again how you're getting the problem ?

xdaco commented

Hi @fenugrec
My usecase was as described in the following.
" A CANOpen CiA-402 slave was always on on the bus with pre-configured TPDOs. (This means that whenever the slave device boots up , it starts sending thise predefined PDOs without waiting for master). The host computer was connected to the bus using the candlelight adapter. And if the host computer reboots without rebooting the slave, the adapter goes into the stale state, where it does not receive or send any can packets . But from kernel side, the adapter is still shows as healthy network interface. "

Hmm. While your candlelight device was off, there was no other device to generate the ACK on the frames sent by your canopen slave - it worked OK with that ? Without ACK , it should normally switch to a "bus off" state and stop sending frames , then probably go back to pre-operational state ?

That's one thing I wasn't able to reproduce yet - currently I have 2 candlelights , and if one of them is off, the other one cannot continue sending frames since it switches to "bus off" due to no-ACK.

xdaco commented

@fenugrec
" Without ACK , it should normally switch to a "bus off" state and stop sending frames , then probably go back to pre-operational state ?"

This is not true when the PDOs are mapped as periodic which was the case for me. The slave does not care if there is any other device or master is present on the bus . The slave will keep sending the frames.

I know what you mean, but I was thinking of the slave's low-level CAN implementation which should revert to "bus off" state as defined in CAN / iso 11898. But I guess once the other device powers up, the slave eventually returns to active state and continues sending frames ?
I think I have a device here that is more... 'perseverant' than candlelight and continues to send CAN traffic even alone on the bus.

@brian-brt have you ever reproduced this issue ? I tried again here with three devices on the same bus :

  • machine 1, cangen can0
  • machine 1, candump can1
  • machine 2, candump can2 (well it's actually "can0" on that machine, but to make things unambiguous)

then I suspend machine 2, resume, bring can2 up again, and everything is back to normal.

I want to duplicate this issue especially if it works "sometimes" as seems to be the case - there may be something else going on that needs to be looked at.

I am running into this (same I think) issue, even on d13b6d. I can reproduce it by:

  1. Automotive device powered on, begins sending on the can bus
  2. Power on PC with canable device, can0 up
  3. Verify that the rx packets/bytes are increasing steadily
  4. reboot -now
  5. can0 up
  6. rx packets/bytes stay at 0
  7. Power off PC and power it back on
  8. can0 up
  9. Same as 3, rx packets/bytes are increasing steadily

Willing to test, let me know how I can help!

@KeithBoden thanks for the reminder, I was forgetting to try to reboot . Suspend + resume was workin fine. So I've finally managed to reproduce this... After a reboot, I breaked in with the debugger , by chance in queue_pop_front :

(gdb) i s
#0  0x08001214 in disable_irq () at /home/q/d/can/candleLight_fw/src/util.c:31
#1  0x08001154 in queue_pop_front (q=0x20000a10) at /home/q/d/can/candleLight_fw/src/queue.c:92
#2  0x0800142c in main () at /home/q/d/can/candleLight_fw/src/main.c:109

Ok, single step out of queue_pop_front : it returned 0 ? weird... Looked around a bit at the hCAN flags,

(gdb) p *hCAN.instance 
$16 = {MCR = 0x44, MSR = 0xc08, TSR = 0x1c000000, RF0R = 0x1b, RF1R = 0x0, IER = 0x0, ESR = 0x0, 
  BTR = 0x1c0005, RESERVED0 = {0x40d61921 <repeats 88 times>}, sTxMailBox = {{TIR = 
.... trimmed boring part

So FIFO0 is full and overflowed according to RF0R, but otherwise probably behaving fine (FIFO1 is not used because no hardware filtering is configured, which is the only way to direct frames to FIFO1)

Then, quick look at the queues;

(gdb) p *q_frame_pool 
$22 = {max_elements = 0x40, first = 0x12, size = 0x0, buf = 0x20000908}
(gdb) p *q_from_host
$23 = {max_elements = 0x40, first = 0x0, size = 0x0, buf = 0x20000a28}
(gdb) p *q_to_host
$24 = {max_elements = 0x40, first = 0x0, size = 0x3e, buf = 0x20000b48}
(gdb) 

[EDIT - I originally misunderstood part of the queue mechanism, and changed the following comments]
As I understand, q_frame_pool is "empty" because all the frame buffers are in q_to_host. Not sure why size = 0x3E, some frames got lost somewhere - I'd expect to see 0x40, CAN_QUEUE_SIZE.

xdaco commented

We are again affected by this issue. But the for our usecase this happening much less

I'm digging into this some more. It appears after a reboot, USBD_GS_CAN_DataIn() is no longer called, which is the only place where TxState is cleared. With TxState at 1, then no packets can make it to the host of course.

Not sure why USBD_GS_CAN_DataIn() stops being called though, even though the device re-enumerated properly (my breakpoint was not to blame , it triggered fine before the reboot).
I thought maybe the EP got stuck stalled, but I wasn't able to verify this - conditional breakpoint on USBD_LL_StallEP causes enumeration to fail (probably too much delay).
To be continued...

Using the firmware here:
https://canable.io/builds/candlelight-firmware/gsusb_canable_68df7d5.bin

The issue reported above is present.

A "sudo reboot" of the host device as opposed to a hardware power cycle results in a "hung" canable.

The only remedy appears to be a physical detachment and reattachment of the canable followed by the ip link set up command.

I am also facing this issue, The fix here seems to have reduced the frequency of the issue but not completely eliminated it. Has anyone found a solution yet.
In my application, it is required to restart the host computer(sudo reboot ) after a remote update, but my canable device hung up after a reboot due to this :(

@GaryWSmith , @smalik007 , I feel your pain, but you need to help me here.

@GaryWSmith : that 68df7d5 build is old (anything before march 2021 has the old USB stack). We don't have a functional CI setup to provide builds here, you need to compile yourself.

@smalik007 thanks for testing PR #51. But did you try it as-is or applied it to current master ? The PR is also pre-USB update; I have rebased it on a temporary banch on my fork https://github.com/fenugrec/candleLight_fw/tree/rebootfix

@fenugrec I just compiled/flashed/tested 08ab6d2 on the rebootfix branch, but am seeing the same results:

  1. Automotive device powered on, begins sending on the can bus
  2. Power on PC with canable device, can0 up
  3. Verify that the rx packets/bytes are increasing steadily
  4. reboot -now
  5. can0 up
  6. rx packets/bytes stay at 0
  7. Disconnect canable from USB, plug it back in
  8. can0 up
  9. Same as 3, rx packets/bytes are increasing steadily

Happy to test any time and many times!

@fenugrec , Yes I tested PR #51 on above latest master branch and locally compiled the binaries. Since the issue comes up randomly on rebooting host system, I have written a python script and added it in my cron job on reboot. The script keeps rebooting the host computer if the can msgs are receiving and as soon as the can device hang up it sends me a email. So I am easily able to reproduce the issue in an hour or so.

Closed by PR #94 !