EnhancedRadioDevices/915MHzEdisonExplorer_SW

production explorer seems to timeout after one read

Opened this issue · 16 comments

I initialize my code with an mmtune which is successful - say db=-47

at the start of my loop I run a pump status request via decocare ie tweak ReadPumpStatus
on the first try this returns no problem and I continue to read history temp basal etc etc all no problem
I then sleep for about 4 min until the next loop

since switching to the production board on my first attempted ReadPumpStatus (for clarity I dont do an mmtune again unless needed so in this case no mmtune) im getting the following error
"ERROR:mmeowlink.handlers.stick:Timed out or other comms error - Received an error response Timeout - retrying: 1 of 3\nERROR:mmeowlink.handlers.stick:Timed out or other comms error - Received an error response Timeout - retrying: 2 of 3\nERROR:mmeowlink.handlers.stick:Timed out or other comms error - Received an error response Timeout - retrying: 3 of 3\nERROR:mmeowlink.handlers.stick:1482372001.91 - Response not received - retrying\n"

its quite repeatable and doesnt look like a freq issue rather I cant reach the board
I then tried just a board reset - doesnt seem to work
I do an mmtune and it works again, and subsequent commands are fine - until after my next sleep then the problem occurs again

so now Im basically mmtuning each loop which Id prefer not too jus to save battery and time - any advice? Thanks!

mogar commented

@kenstack Did this happen with the prototype board, or just the new production board? Are you still using the default firmware, or did you switch to the firmware with sleep enabled?

@mogar with the Prototype this would happen occasionally - but this is repeatable with the production board. I'm using the firmware that came with the board - @bustavo mentioned that the first command he runs fails and then it works - I'll try to see if that is my issue as well

@mogar @bustavo I did some further testing by just running mmtune and mmeowlink-send commands to the board without my loop on. Here is what I can determine.

  1. It seems I get these errors after a sleep or no interaction with the edison, ie just sitting in terminal for more then 60-90 sec after a successful execution of mmtune or mmeowlink-send - if I keep sending commands it seems it stays awake

  2. its not completely repeatable. if for example I run an mmtune followed by a mmeowlink-send for say something simple like ReadPumpStatus, sometimes it runs fine after a sleep and sometimes not

  3. per @bustavos similar observation - and here is the key - after I get a fail on mmeowlink-send, I send the same command again. I always get another failure. However, when I send it a 3rd time, it always works. so the act of sending the command twice after a fail works.

so is mmeowlink somehow waking the board? strange it would need me to send it twice - Im wondering is there some delay parameter or something being used for power saving?

any insights much appreciated

here is an example - Ive tried it about 7 or 8 times and whenever I get the fail, if I send the same command 2 more times on the last try it works

root@nhex2:/home/edison/nodehawk# /home/edison/mmeowlink-source/bin/mmeowlink-send.py --init --serial 111111 --port spi_serial tweak ReadPumpStatus
ERROR:mmeowlink.handlers.stick:Timed out or other comms error - Received an error response Timeout - retrying: 1 of 3
ERROR:mmeowlink.handlers.stick:Timed out or other comms error - Received an error response Timeout - retrying: 2 of 3
ERROR:mmeowlink.handlers.stick:Timed out or other comms error - Received an error response Timeout - retrying: 3 of 3
ERROR:mmeowlink.handlers.stick:1483662313.14 - Response not received - retrying
Traceback (most recent call last):
File "/home/edison/mmeowlink-source/bin/mmeowlink-send.py", line 7, in
SendMsgApp().run(None)
File "/usr/local/lib/python2.7/dist-packages/decocare/helpers/cli.py", line 113, in run
self.prelude(args)
File "/home/edison/mmeowlink-source/mmeowlink/cli/base_mmeowlink_app.py", line 39, in prelude
self.pump.power_control(minutes=args.session_life)
File "/home/edison/mmeowlink-source/mmeowlink/handlers/stick.py", line 231, in power_control
raise CommsException("No acknowledgement from pump on wakeup. Is it out of range or is the battery too low?")
mmeowlink.exceptions.CommsException: No acknowledgement from pump on wakeup. Is it out of range or is the battery too low?
root@nhex2:/home/edison/nodehawk# /home/edison/mmeowlink-source/bin/mmeowlink-send.py --init --serial 111111 --port spi_serial tweak ReadPumpStatus
Traceback (most recent call last):
File "/home/edison/mmeowlink-source/bin/mmeowlink-send.py", line 7, in
SendMsgApp().run(None)
File "/usr/local/lib/python2.7/dist-packages/decocare/helpers/cli.py", line 113, in run
self.prelude(args)
File "/home/edison/mmeowlink-source/mmeowlink/cli/base_mmeowlink_app.py", line 26, in prelude
self.link = link = LinkBuilder().build(args.radio_type, port)
File "/home/edison/mmeowlink-source/mmeowlink/link_builder.py", line 16, in build
return SubgRfspyLink(port)
File "/home/edison/mmeowlink-source/mmeowlink/vendors/subg_rfspy_link.py", line 55, in init
self.open()
File "/home/edison/mmeowlink-source/mmeowlink/vendors/serial_interface.py", line 28, in open
self.check_setup()
File "/home/edison/mmeowlink-source/mmeowlink/vendors/subg_rfspy_link.py", line 72, in check_setup
self.serial_rf_spy.sync()
File "/home/edison/mmeowlink-source/mmeowlink/vendors/serial_rf_spy.py", line 121, in sync
raise CommsException("Could not get subg_rfspy state or version. Have you got the right port/device and radio_type?")
mmeowlink.exceptions.CommsException: Could not get subg_rfspy state or version. Have you got the right port/device and radio_type?
root@nhex2:/home/edison/nodehawk# /home/edison/mmeowlink-source/bin/mmeowlink-send.py --init --serial 111111 --port spi_serial tweak ReadPumpStatus
GOT MODEL ReadPumpModel:size[64]:data:'723'
<class 'decocare.commands.ReadPumpStatus'> {}
response: ReadPumpStatus:size[64]:data:{'status': 'normal', 'bolusing': False, 'suspended': False}
hexdump:

0000   0x03 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........
0008   0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........
0010   0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........
0018   0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........
0020   0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........
0028   0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........
0030   0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........
0038   0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00    ........

decoded:

{'status': 'normal', 'bolusing': False, 'suspended': False}
cjo20 commented

This is the issue that I'm currently trying to debug too. Definitely something weird going on. It seems to be repeatable using the prototype board. If there has been a gap since the last communications, I get the No acknowledgement message, followed by the Could not get subg_rfspy state or version, then a successful run.

I got as far as hooking up a logic analyser and writing a (partial) decoder for communications over the spi link.

First attempt at pump model:

>>> GET_STATUS
0.002802 seconds since last message
<<< ['0x4f', '0x4b', '0x0']
<<< OK
0.004588 seconds since last message
>>> GET_VERSION
0.001791 seconds since last message
<<< ['0x73', '0x75', '0x62', '0x67', '0x5f', '0x72', '0x66', '0x73', '0x70', '0x79', '0x20', '0x30', '0x2e', '0x38', '0x0']
<<< subg_rfspy 0.8
0.009818 seconds since last message
>>> GET_VERSION
0.001918 seconds since last message
<<< ['0x73', '0x75', '0x62', '0x67', '0x5f', '0x72', '0x66', '0x73', '0x70', '0x79', '0x20', '0x30', '0x2e', '0x38', '0x0']
<<< subg_rfspy 0.8
0.014673 seconds since last message
>>> SEND_AND_LISTEN Send Channel: 0 Send Repeat: 0 Send Retry Delay: 0 Read Channel: 0 Read Timeout: 500 Read Retry: 0
	Send Data: []
0.531553 seconds since last message
<<< ['0xaa', '0x0']
<<< ª
0.004374 seconds since last message
>>> SEND_AND_LISTEN Send Channel: 0 Send Repeat: 0 Send Retry Delay: 0 Read Channel: 0 Read Timeout: 500 Read Retry: 0
	Send Data: []
0.536367 seconds since last message
<<< ['0xaa', '0x0']
<<< ª
1.011105 seconds since last message
>>> SEND_AND_LISTEN Send Channel: 0 Send Repeat: 0 Send Retry Delay: 0 Read Channel: 0 Read Timeout: 500 Read Retry: 0
	Send Data: []
0.539493 seconds since last message
<<< ['0xaa', '0x0']
<<< ª
2.011227 seconds since last message
>>> SEND_PACKET Send Channel: 0 Send Retry: 249 Send Retry Delay: 0
	Send Data: []
1.012831 seconds since last message
>>> SEND_PACKET Send Channel: 0 Send Retry: 249 Send Retry Delay: 0
	Send Data: []
6.826939 seconds since last message
<<< ['0x0']
<<< 
>>> GET_PACKET Channel: 0 Timeout: 500

At this point I get the message about pump not acknowledging things.

Second attempt at pump model:

<<< ['0x0']
<<< 
>>> GET_STATUS
0.003876 seconds since last message
>>> GET_VERSION
0.001933 seconds since last message
<<< ['0x73', '0x75', '0x62', '0x67', '0x5f', '0x72', '0x66', '0x73', '0x70', '0x79', '0x20', '0x30', '0x2e', '0x38', '0x0']
<<< subg_rfspy 0.8

This is the "Could not get status" run

The final, successful run:

>>> GET_STATUS
0.002722 seconds since last message
<<< ['0x4f', '0x4b', '0x0']
<<< OK
0.006509 seconds since last message
>>> GET_VERSION
0.001843 seconds since last message
<<< ['0x73', '0x75', '0x62', '0x67', '0x5f', '0x72', '0x66', '0x73', '0x70', '0x79', '0x20', '0x30', '0x2e', '0x38', '0x0']
<<< subg_rfspy 0.8
0.009506 seconds since last message
>>> GET_VERSION
0.001569 seconds since last message
<<< ['0x73', '0x75', '0x62', '0x67', '0x5f', '0x72', '0x66', '0x73', '0x70', '0x79', '0x20', '0x30', '0x2e', '0x38', '0x0']
<<< subg_rfspy 0.8
0.014021 seconds since last message
>>> SEND_AND_LISTEN Send Channel: 0 Send Repeat: 0 Send Retry Delay: 0 Read Channel: 0 Read Timeout: 500 Read Retry: 0
	Send Data: []
0.147935 seconds since last message
<<< ['0x53', '0x25', '0xa9', '0x68', '0xd9', '0xd1', '0x65', '0x9a', '0x68', '0xd5', '0x59', '0x56', '0x38', '0xe5', '0x8e', '0x58', '0xf4', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x95', '0x55', '0x0']
<<< S%©hÙÑe�hÕYV8å�XôUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU�U
0.064423 seconds since last message
>>> SEND_AND_LISTEN Send Channel: 0 Send Repeat: 0 Send Retry Delay: 0 Read Channel: 0 Read Timeout: 500 Read Retry: 0
	Send Data: []
0.152358 seconds since last message
<<< ['0x53', '0x26', '0xa9', '0x68', '0xd9', '0xd1', '0x65', '0x9a', '0x68', '0xd5', '0x59', '0x56', '0x38', '0xe5', '0x8e', '0x58', '0xf4', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x95', '0x55', '0x0']
<<< S&©hÙÑe�hÕYV8å�XôUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU�U
0.064638 seconds since last message
>>> SEND_AND_LISTEN Send Channel: 0 Send Repeat: 0 Send Retry Delay: 0 Read Channel: 0 Read Timeout: 500 Read Retry: 0
	Send Data: []
0.149744 seconds since last message
<<< ['0x53', '0x27', '0xa9', '0x68', '0xd9', '0xd1', '0x65', '0x9a', '0x68', '0xd5', '0x59', '0x56', '0x38', '0xe5', '0x8e', '0x58', '0xf4', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x55', '0x95', '0x55', '0x0']
<<< S'©hÙÑe�hÕYV8å�XôUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU�U

That's probably as clear as mud, I'm working on tidying things up.

cjo20 commented

I suspect "double failure" is the first failure leaving the cc1110 in a bad state of some sort.

>>> SEND_PACKET Send Channel: 0 Send Retry: 249 Send Retry Delay: 0
	Send Data: ['0xa9', '0x68', '0x00', '0x00', '0x00', '0x00', '0x94', '0xd5', '0x55', '0x69', '0x60', '0x0']
	Decoded: ['0xa7', '0x00', '0x00', '0x00', '0x5d', '0x0', '0x87']

Here a 0x5d command is being sent (250 times). I believe that is a PowerControl command

6.821310 seconds since last message. 152 empty packets. RW

Over 6 seconds later, after both the edison and the cc1110 have said "I've got no data to send" 152 times, both happen to decide they need to send something at the same time.

EDIT: It looks like there is a sleep() of almost 6 seconds before the app tries to read a response in order to give the chip time to send the 500 requests. Not sure if the sleep is required.

<<< ['0x0']
>>> GET_PACKET Channel: 0 Timeout: 500

The CC1110 sends back 1 null byte, and the Edison requests a packet with a 500ms timeout

2.080431 seconds since last message. 1 empty packets. RW

2 seconds later, the packet request clearly timed out. I think this caused the "no acknowledgement" error message and the previous pump model request to die. After sending another "get the model" command via the command line, The CC1110 decides it has something to send (another NULL byte) at the same time as the Edison requests the subg status.

<<< ['0x0']
>>> GET_STATUS
0.005256 seconds since last message. 1 empty packets.  W
>>> GET_VERSION

After sending the GET_STATUS request, the edison asks if the cc1110 has any data to send back and it declines. The Edison then tries to get the version. The lack of a status response here is what gives the Could not get subg_rfspy state or version error.

cjo20 commented

I think this might be caused by https://github.com/oskarpearson/mmeowlink/blob/dev/mmeowlink/handlers/stick.py#L182

I've tried changing the delay from 0.016 seconds to 0.02 seconds and it seems to have changed the behaviour slightly. Needs more testing.

cjo20 commented

Been running for a while with 0.02 seconds and it seems far better behaved. Still get the occasional 'no ack' errors, but the 'Could not get state' errors seem to have disappeared. Changed the value back to 0.016 and could immediately reproduce the 'double error'

I would guess that issuing a "GET_PACKET" while the cc1110 is still processing the previous SEND_PACKET is causing it to misbehave. If that is the case, it could either be fixed in this case by changing the wait period in mmeowlink, or causing GET_PACKET to return a specific status message if the cc1110 is still busy sending.

I'm still unsure why the intermittent 'no ack' messages appear.

@cjo20 very cool - where did you make the change? I can try it tomorrow out of pocket today

ehh never mind I see the link above thanks

mogar commented

@cjo20 Currently the subg_rfspy firmware will return an 'OK' status as long as it's on and in a good state. We could change that to respond with one of ['OK', 'TX', 'RX', etc.] and you could check status before trying to issue a new command. In that case you'd have to do a bit more to handle state in mmeowlink, but it would fail more gracefully. Thoughts?

cjo20 commented

@mogar Having had a chat about how mmeowlink works, I'm not sure this will be necessary to solve the problem. I'll try to look in to it more.

@ps2 - I was wondering if you had any idea why this is happening? It doesn't seem consistent, unfortunately. Chris has a trace of the packets above.

In the mmeowlink code I have a 'sleep factor' that tries to avoid talking to subg_rfspy while it's still transmitting packets. It seems that there's some problem with the logic there. There's been some discussion on oskarpearson/mmeowlink about it that provides background and links etc.

@cjo20 I tried the change last night for a bit - I think it helped a little bit, but was hard to say it was a lot better - still got lots of issues

cjo20 commented

@kenstack I found that it stopped the double-error problem, which let the loop run some of the time. I'm still using a custom loop, which runs subg-ww-radio-parameters and mmtune every time I think pump comms failed. Typically, I get one "good" loop just after running that, then a "bad" one, which triggers a retune. I think I might have discovered the cause of mmeowlink reporting no acknowledgement:

@mogar
Something I just put on mmeowlink gitter:
https://github.com/EnhancedRadioDevices/subg_rfspy/blob/master/commands.c#L70 That might be what is causing some of the problems. That'll be that 0x0 I'm seeing. If that's doing what I think it's doing, SEND_PACKET isn't async, it sends the packets, then sends back a 0x0 to say it's done

Reading a packet just looks at what comes back over serial. I suspect if you issue a GET_PACKET before SEND_PACKET finishes, the GET_PACKET gets the first thing that comes back over serial. Which is 0x0, which means "end of packet". So it thinks it got an empty packet, which isn't a valid response to a power control command.

I don't think this needs more complexity in the firmware to fix the issue; it should be possible to change mmeowlink to wait for the 0x0 response at the end of SEND_PACKET before issuing the GET_PACKET. I'll see what @oskarpearson suggets.

ps2 commented

Looking at the logs that @cjo20 posted, it looks like the client is not waiting long enough after the first SEND_PACKET with 259 repeat; I don't see a response there, and the next SEND_PACKET gets sent out 1s later. The client should wait for the '\0' to come back over the connection, and have a longer timeout. It takes a while to send that many packets.

cjo20 commented

I've currently got a PR for mmeowlink which seems to mitigate some of the errors seen here: https://github.com/oskarpearson/mmeowlink/pull/57/files

I now get errors where the software is getting back NULL data (firmware returns 0xaa, timeout) or a bad CRC. I've not yet been able to track down what is causing those responses.

Im seeing a similar error on production explorer with older edison board
"Mar 03 03:27:43 smartpump2 Timed: out or other comms error - Received an error response Timeout - retrying: 1 of 3
Mar 03 03:27:43 smartpump2 Timed: out or other comms error - Received an error response Timeout - retrying: 2 of 3
Mar 03 03:27:45 smartpump2 Timed: out or other comms error - Received an error response Timeout - retrying: 3 of 3"