candle-usb/candleLight_fw

No Can Frames recieved after device reset without a power cycle. candleLight Firmware.

Closed this issue · 7 comments

ieb commented

Setup is a candleLight on a f07 chip.
I am interacting with the device using libusb on OSX (no kernel drivers are available). Client program uses nodeJS and so is single threaded down to the libusb layer, perhaps below that.
Everything works perfectly on a can bus with an ESP32/TJ1050 sending simulated frames and the candleLight in node mode (so the ESP32 gets an ACK) at 250K with no errors for > 24h, busload about 10%. With leds flashing as expected.

Recieve is using polling in libusb with a 200ms timeout which eventually calls lib_usb_transfer.

When I exit the client program normally, I stop the transfers by waiting for the 1 transfer in flight to timeout, then send a device reset after which I release the USB interface and filnally close the USB device before exiting.

If I start the client program up again, all the control commands return ok and I can interact normally with the device (eg send identify acts normally), no CAN frames are received and the receive lib_usb_transfer reports timeouts. The leds are both on and steady, no flashing indicating that nothing is being received from the hardware can interface in can.c (if I read the code correctly).

If I power cycle the device, and try again, it works perfectly again.

I have looked at the gs_usb kernel driver and as far as I can tell I am doing exactly the same reset procedure. Is there something more I need to do during shutdown or startup to get the device into its power up state before starting to recieve. If the leds were flashing I would suspect OSX or libusb, but since they are fixed and steady I am suspecting something in the candleLight fw has not reset, or a fifo has overflowed during the shutdown sequence. (will try to stop the ESP32 sending before shutdown to see if a busload of 0 eliminates the behaviour)

Reset sequence is here https://github.com/ieb/candleLightJS/blob/main/gsusb.js#L382

The FW I am using is https://github.com/ieb/candleLight_fw, which is currently at your HEAD.
I can load modified fw if you have any suggestions ?

ieb commented

Confirmed that if I stop the ESP32 from putting any frames onto the bus, and drain all messages from the candleLight device before I shutdown, then I can run the client again with no power cycle. This also works if I start the ESP32 sending frames before I re-enable the candleLight device.

Will try changing the stop sequence to reset the can device then drain just in case its possible to stop receiving from the bus and then drain buffers.

ieb commented

The correct shutdown sequence is:

  1. Disable the HW CAN interface in the STM (ie https://github.com/candle-usb/candleLight_fw/blob/master/src/can.c#L137)
  2. Drain any pending frames by poll the usb interface (lib_usb_transfer) to ensure that all messages are transferred off the device. Indicated by a timeout on lib_usb_transfer.
  3. release the USB interface.
  4. close the USB interface.

fixed for me in ieb/candleLightJS@c36eb30
The kernel gs_usb probably does disables CAN HW [1] and then disables transfers [2] which polls till there is nothing pending.

HTH others,
Sorry for the noise, nothing wrong with the candleLight FW.

1 https://github.com/torvalds/linux/blob/master/drivers/net/can/usb/gs_usb.c#L1073
2 https://github.com/torvalds/linux/blob/master/drivers/net/can/usb/gs_usb.c#L1081

Hey @ieb,

the kernel driver first stops sending, kills all RX URBs, then kills all TX related URBs and finally sends a reset command to the device.

I think we don't handle the reset in the firmware as good as we can or should :)

ieb commented

With the reset last, isn't there a risk on a busy bus that the firmware gets into the same bad state as I was seeing ?

I've not seen the firmware get into a bad state with the kernel driver, but then I suspect the kernel driver doesn't disconnect from the usb device until its unplugged, which would power cycle it anyway.

Or is the delay between kill all URBs and reset short enough to stop frames from the can hw getting to the fw before a reset. I am sure my user space code is far slower than the kernel driver.

Even though I have a fix, I would be happy to experiment with a better fw reset if there are pointers. Is the datasheet a good place to start ?

With the reset last, isn't there a risk on a busy bus that the firmware gets into the same bad state as I was seeing ?

Yes, maybe. I thought the kernel driver did a "can_disable" first , but I haven't looked closely at that part in a while. At a first glance it would make sense to stop the CAN peripheral as early as possible, and then take care of the queues on the device, and the URBs on the host side. Depending on anything being "fast enough" to not have race conditions is a bad idea obviously !

I think we don't handle the reset in the firmware as good as we can or should :)

In any case, I agree with Marc ^

I've not seen the firmware get into a bad state with the kernel driver

There are IIRC some edge cases related to full queues on the device (e.g. after a bus disconnect), I saw some weirdness last time I tried but didn't investigate.

Even though I have a fix, I would be happy to experiment with a better fw reset if there are pointers. Is the datasheet a good place to start ?

For the full picture, the gs_usb kernel driver that you've already found, this repo, and a Reference manual for any of the supported devices, should be sufficient. The F0x2 targets have the same bxCAN peripheral, so I suggest e.g. RM0091 . The more modern chips with the CANFD peripheral may be more complex, and slightly different semantics.

ieb commented

I have modified my firmware to add metrics which can be retrieved with a control message. When the recieved messages are not drained after disabling the can HW, then the https://github.com/ieb/candleLight_fw/blob/withFilters/src/main.c#L150 always return null so that the counter here https://github.com/ieb/candleLight_fw/blob/withFilters/src/main.c#L181 equals the counter here https://github.com/ieb/candleLight_fw/blob/withFilters/src/main.c#L108.

I think the pool of unused frames between the device and usb host gets exhausted, rather than there being any problem recieving from the can HW.

This is an example of the difference between counters at 500ms intervals reported by a client polling when the device is stalled

{"main_loop":1247,"send_to_host":0,"recv":0,"no_recv":0,"no_pool_frame":1246,"error":0,"no_error":0,"spare":0}
and this the difference in counters at the same 500ms when operating normally.

{"main_loop":4161,"send_to_host":2508,"recv":10,"no_recv":0,"no_pool_frame":0,"error":0,"no_error":4152,"spare":0}
Metric names are the same as the members of struct USBD_GS_CAN_Metrics all in main.c

Will try and find a way of making the reset safe so it doesn't exhaust the pool.

But, send_to_host is also 0 indicating that the USB host is not advertising it is ready to receive data. That should happen regardless of the state of the FW..... and so the problem may be nothing to do with the candleLight FW, rather something on the USB Host. This is Apple hardware, so no gs_usb kernel support, but the device works normally running in a linux VM. Could be libusb, of the WebUSB library implementation.

ieb commented

Confirmed, my problem is with libusb on darwin and the way in which the usb layer inside darwin works.

Nothing to do with the candleLight firmware, which behaves perfectly.

There are some notes in the libusb documentation about clearing halts without a device reset putting the darwin kernel into a bad state. Unplugging the USB device clears that state... draining messages before closing ensures it never gets into a halt state also.