sinara-hw/Booster

VCP power measurements not updating correctly

Closed this issue · 33 comments

On the v1.4 hw we have I just noticed something interesting.

  • monitoring the power via the VCP continuous parameter dump (start)
  • when the interlock tripped for one channel, the reported TX power didn't update
  • I used the SCPI interface to query the power. The first time, it still reported a high power
  • after running it again, the reported power dropped on both the SCPI and VCP interfaces

After clearing the interlock, I saw a similar thing:

  • the VCP still reported a low power
  • the SCPI interface also reported a low power the first time I queried it
  • after querying the SCPI interface a second time, it updated correctly and the VCP interface started showing a high power again

hmm...no this isn't just the interlock. It seems that sometimes the power detector just stops updating in SW. e.g. if I change the RF power the VCP still reports the same power even after 10s or more. The SCPI also shows the previous power the first time I call it, and only updates when I call it a second time. Not sure if a power cycle clears this state.

hmm...even updating the interlock threshold seems to cause Booster to start updating correctly again

Yes, the effect is strongly dependent on the length of cable at Booster's output so most likely interference from the double reflection.

Hmm, channel measurements are taken when channel power is enabled. That's the only limiting factor which can stop measurement task from running on selected channel. But again, if channel is interlocked, only RF switch is disconnected, channel has still enabled power

I'll write some code to fuzz Booster soon. That should help to identify any remaining issues like this.

But again, if channel is interlocked, only RF switch is disconnected, channel has still enabled power

NB this even happened after the interlock was disconnected (see top post). Not sure if Booster stoped updating the power, or if the updates didn't make it to the VCP/SCPI interface. But, as I said, there is definitely a bug here somewhere; my power meter told me the power had changed, but the VCP interface lied and said it was the same. Asking the SCPI interface gave me a lie the first time, after which it started telling the truth...

ok, I'll check it

any thoughts about this one? Otherwise I'll see if I can reproduce it with a HITL test (will take a day or two)

@wizath could this be related to the I2C issues? If so, why would asking the power via the SCPI interface fix the issue? Does the SCPI interface trigger an I2C transaction, or just read from a previously stored value in memory?

@wizath did you think of any way that this could occur? Why does a scpi read affect the vcp issues? Could this be linked to i2c errors somehow? Does an scpi power query trigger an i2c transaction?

Why would the power measurements stop updating? And why did an scpi power query make them start again? I don’t understand this

All I can think of is some task blocking the measurement task - that would stop variables from updating. I'll run task tracing next week to check

@wizath i get that a task hanging could block the measurement cycle.

The thing I don’t understand is why a power query over scpi would clear the blockage. That’s odd isn’t it?

What I didn’t check was whether other scpi commands also unblocked the measurement loop. So maybe it’s possible that the issue here is the Ethernet task hanging and blocking measurements

But if that’s the case does it mean that the software interlocks etc were also blocked? If so that’s serious!

Isn’t there a watchdog that reboots the fw if a task hangs?

@wizath so...this is the last known issue then (!!!) I still can't understand it. How could an SCPI command cause a task to unfreeze??

I'm beginning my search for the cause of the problem

In order to help understand this, do you think we should add more diagnostics? If so, what can we do? Maybe a command that dumps the status of all tasks (including when they last updated) so we can see what has crashed/frozen/dead-locked/etc

hmm...looking over the code again, I really don't see how querying the output power could do anything, since that command doesn't interact with the hardware. Odd

@wizath did you have any more thoughts about what could have caused this, or any other diagnostics we can add to try to track it down if it reoccurs?

I'll try to track this once I've complete v.1.4's for you

cool, thanks :)

Anyway, ATM no idea what's going on.

Well, the firmware version I saw this on was quite old, so it's possible it was a symptom of some other bug. I'd suggest that you have a look at the code and see if you can find any source of the issue, or think of any more diagnostics we can implement. If not, we can close and re-open if it occurs again.

For me though, the smoking gun is why the VCP interface only started reporting correctly after an SCPI query, and why the first SCPI query gave the wrong value, but after that it worked. It's really odd because as far as I can tell the SCPI query doesn't interact with the hw directly, so should have no impact. Sounds like a deadlock between the threads or something.

Maybe that was related to that weird bug with double to string conversion that caused HardFaults

Anyway need to confirm that this still occurs in recent firmware

maybe. Anyway, please have a think about potential causes before we close. If there are any other diagnostics we can add (e.g. to look at last update time of each thread to check for deadlocks) then it might be worth doing that -- otherwise if I do reproduce it, we still won't have any new info...

@wizath did you have any thoughts about potential causes of this, or any extra diagnostics we should add?

@wizath I can't reproduce this. So, unless you can think of any potential cause or any additional diagnostics to add, please can you close?

I'll check task runtime statistics and close

Cool, thanks. A VCP command to print task statistics would be awesome. Maybe we could even dump that on watchdog resets to see what task hung?

We've got task-stats already. BTW - if we look at output

Task            State   Priority        Stack   #
**************************************************
UDPServer       B       3               2310    4
CH MEAS         B       2               681     9

That would suggest if we spam SCPI hard enought there may be not enough time to run measurement task. Maybe that was the case? I'll adjust task priorities

Maybe. Note that from my description of the issue above:

  • I had the VCP in start mode
  • I had an SCPI connection open, but no traffic
  • I was getting data from the VCP, but the data wasn't updating (e.g. I could change the output power but always saw the same power and current reported)
  • When I ran a few SCPI commands I started getting fresh data

So it seems that some task had stalled. So statistics on when each task last started/stopped would be really useful. If that were available after a watchdog it might also help to debug that (since we could see which task stalled)

Did the scpi log show anything?

I haven't seen this in ages and the FW has changed quite a bit since then. I think we will end up with a major firmware refactoring at some point in any case, so I don't think it's worth keeping this open any longer.