gdgib/span

Version 202342 seemingly broke integration this morning?

sargonas opened this issue · 23 comments

EDIT: for a quick jump to what you need to change to manually fix this, see below at :#21 (comment)

This morning around 9:50am PST my panel briefly went offline and then returned, which is generally indicative of a firmware update. The web UI now shows it running spanos2/r202342/04 which is not listed on their website of software revisions, which I assume to mean is a new version that just came out today. However, after this update my integration is no longer able to talk to the panel, despite several efforts to restart the integration and HA itself after unlocking the panel.

All of my logs are full of the following summarized error: ERROR (MainThread) [custom_components.span_panel] Unexpected error fetching span panel SN-TODO data: 'remainingAuthUnlockButtonPresses'
and
KeyError: 'remainingAuthUnlockButtonPresses'

I first did the 3 button press proximity verification, just to be safe, but that had no effect. I also verified that my UptimeKuma monitoring of the panel's web state continued to work as expected, using an auth token in the header, without issue to return back the panel status... which between that, my manual web interactions, and another custom interaction I wrote would indicate that the panel itself is working as (generally) expected but something unique to the Integration's communications with the panel is potentially wonky as of this update due to a presumed change in the panel firmware.

Additionally, I tried to add the panel fresh from scratch using my test-instance of HA that I use for general update validation, debugging, and development, however this also failed, resulting in a generic "Unknown error occurred" in the UI and a KeyError: 'remainingAuthUnlockButtonPresses' in the logs.

For full context, the entire error block of my logs, with traceback, when I restart HA or reload the integration is as follows:

2023-11-13 10:25:03.333 ERROR (MainThread) [custom_components.span_panel] Unexpected error fetching span panel SN-TODO data: 'remainingAuthUnlockButtonPresses'
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 290, in _async_refresh
    self.data = await self._async_update_data()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/update_coordinator.py", line 246, in _async_update_data
    return await self.update_method()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/span_panel/__init__.py", line 50, in async_update_data
    await span_panel.update()
  File "/config/custom_components/span_panel/span_panel.py", line 51, in update
    self.status = await self.api.get_status_data()
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/span_panel/span_panel_api.py", line 59, in get_status_data
    status_data = SpanPanelStatus.from_dict(response.json())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/span_panel/span_panel_status.py", line 37, in from_dict
    remaining_auth_unlock_button_presses=data["system"][
                                         ^^^^^^^^^^^^^^^
KeyError: 'remainingAuthUnlockButtonPresses'
mbbush commented

I'm running the same firmware version, and my HA integration is also broken. It looks like there was a change in the output format of the api/v1/status endpoint, which no longer has a remainingAuthUnlockButtonPresses integer value, but instead a proximityProven boolean.

This should be a reasonably straightforward change in the integration to no longer deserialize that key, or at least to fail gracefully when it's missing.

The rest of the integration should be working fine, it's just throwing an exception during setup and then halting.

I was just poking through the code as well and beginning to suspect the same: that a value reference simply changed.

I might be mildly out of my depth to try to attempt a proper fix to this worthy of a PR myself, but I'll give it a shot and see if I can't do either a rough edged replacement of the handling of that key or a graceful fail on my own fork and get that in place in the short term for now.

Doing a little stumbling around in the dark using limited python knowledge and common sense, but about to start testing on my local setup some changes currently living here, if anyone wants to critique or fix my efforts: https://github.com/sargonas/span/tree/v202342-fix

The issue I'm currently poking around at is after changing reliance on remainingAuthUnlockButtonPresses to a reliance on a proximityProven boolan to be True, is that when adding a Panel on a test environment I now get log errors of homeassistant.data_entry_flow.UnknownStep: Handler ConfigEntriesFlowManager doesn't support step auth_menu, but I didn't touch async def async_step_choose_auth_type where auth_menu lives within config_flow.py, only async def async_step_auth_proximity.

So, I have to assume that either something in my changes in that section have a downstream effect that I need to track down OR there is something wrong with the code in my custom branch of the integration that is not loading correctly when the code from the main branch DOES load fine (where the original bug above reappears as expected).

EDIT: Update - interestingly... I just reverted the code to before my changes in a new branch, then I simply commented out the remainingAuthUnlockButtonPresses key and the variable it is assigned to, as well as commented out the check on that variable's value to just.. skip right over it and assume the panel is unlocked, for testing, and yet the same error about auth_menu being invalid returns. At this point I think something is misconfigured with my test env for HA that is resulting in the Integration perhaps not loading correctly and therefore not making its config flow steps available as expected.

Update 2:
If I revert all of my changes to the current broken release code, and simply make two changes: remove the line that serializes that now non-existent value from the API and then change the line that assigns that serialized value to a variable, to instead set that variable to the value of 0, then the same auth_menu error returns. This tells me one of two things:

1: there is an underlying code issue in the core integration that is no longer valid and now broken at conveniently the same time (highly unlikely)
or
2: my test integration install is not installed right, and therefore creating a broken test environment for me (most likely).

In any case, looks like I need to dive into that before I can test my fix thoroughly, but my current proposed fix still lives in the branch linked above if anyone wishes to sanity check it while I figure out how to properly test it.

Okay, after further troubleshooting it turns out the code in main is actually broken after all. While doing some comparisons, I realized that what we are running as 0.0.7 is slightly different from what is in main, and if I manually make my proposed changes by hand to my existing install code from 0.0.7, it comes right back up right as rain with the changes.

I am going to go ahead and PR my current fixes to the repo. Then I'll go back and start a new deep dive to find out what is wrong with the code currently in main, unless @gdgib beats me to it first. I have now also updated the broken config_flow.py references in main as well, so now PR #22 will fix everything if a new release is cut from main once merged in!

In the meantime while we await a merge, @mbbush (and others) if you take the 5 or so changes I have in https://github.com/sargonas/span/tree/v202342-fix and make them by hand to the code in your existing install, it should get you back up and running after a quick HA restart. You also have the option of just manually loading my fork's main into HACS as a custom integration temporarily as well.

(Note that if you make manual edits then you will need to either rename the related value in strings.json to auth_proximity as well from it's old value, OR leave the old value in config_flow.py and not use that name I reference in my edits.)

@sargonas Do you know the effect on a panel WITHOUT the new FW? (I have two and apparrently only one has taken the new firmware yet.)

@sargonas Do you know the effect on a panel WITHOUT the new FW? (I have two and apparrently only one has taken the new firmware yet.)

You bring up an EXCELLENT point that I didn't think about. My basic understanding was that updates were forced, and rolled out install-base wide over roughly a 24 hour period so I never stopped to think about a scenario where someone was running an older version, as I just automatically assumed everyone would have it within a day or so of the update.

The PR with my updates will only work if you are running 202342 or higher, and will presumably error with the exact same error the old version has due to the missing key on an older version.

I'll go back tonight and figure out how I can make it gracefully fail if that key is missing from the API... that'll be uncharted territory for me but so far I've been able to crash course/learn things with this pretty quick, lol

I'm thinking the easy way is to just find a way to make it skip over requiring that API value, and allow things to continue without it and sort of just go into a "edge case" config mode where you will need to know to push the button three times manually in order for it to continue from a stalled state. Alternatively the over engineered solution is insert and if-else that enumerates the API values and if the version number is below a certain point it uses the old value and references old code in config_flow.PY and if it's newer then it uses the new values and new code instead... then one day a year from now that old code and logic can be removed out when it's safe to assume no one will have an old version for sure.

Well *&%^ that's what I thought - I can either get ONE panel working or the OTHER,

Not your fault - but yes, the ability to do one then if it breaks gracefully fail back is NECESSARY. A lot of us have two or I've even heard of three panels and there's absolutely no guarantee when they get updates. I've had my panels update weeks apart.

Maybe pull your fork in as a custom SPAN#2 for the time being (yes it's that urgent here) Is that possible? Two customs? they wont interfere?

I'm gonna try to figure out tonight… (because bear with me my python skills are quite rudimentary, and I'm learning on the fly here)… And see if I can't insert some if logic, reinstate the old code alongside the new code, and have it pull the right API value and leverage the right code based on the version number... however I think versions are actually referenced as a slightly obscured string with slashes, and not just a number… So I'm also gonna have to break out the ole regex toolkit while I'm at it lol

Maybe pull your fork in as a custom SPAN#2 for the time being (yes it's that urgent here) Is that possible? Two customs? they wont interfere?

Two customs won't interfere, just give mine a unique folder name (like span_panel_temp) and then edit the manifest.json entry for the domain to match that. However you would have to either re-add a panel to it as a new device (with new device IDs for it) OR dig deep into the config files and get funky with assignments between entities and integrations. It's complicated to preserve an entity and move it to a different integration, but it's doable if within your skillset.

However you would have to either re-add a panel to it as a new device

Unfortunately, I've become very adept at this and I'm currently waiting for a partial restore to complete because I broke something trying to 'fix' my panel and it's just easier to restore to last nights backup.

then edit the manifest.json entry for the domain to match that

This part does not seem interesting to me at all for a temp job. so YES they do conflict without manual intervention - I was afraid of that too.

Thanks for looking. I'm going to see what I can do to get my 2nd panel to update then swap to your fork unless @gdgib pulls in the merge ASAP.

(but while you're at it... When you get done fixing the auth - your next auth challenge: you'd make a LOT of happy people with this one - its the durable auth PR on Main.)
https://github.com/gdgib/span/pull/4/commits

@sargonas, I'm reading that PR above, it also has the info about how to query the software version - they present it as a sensor for the panel.

@nathan-curtis I think I have you set!

I added logic that will load either/or variable depending on what firmware you have for the given panel in question during setup, and then run the right setup flow logic as well based on that. (However, it does not count down the number of presses anymore and just tells you to pop the button 3 times no matter what, I might go back and add this but its a lot of string translation logic to do that needlessly IMHO)

If you are comfortable grabbing code straight from a branch and loading it manually, please try the code here: https://github.com/sargonas/span/tree/dual-logic

It works for me on my new firmware unit, and so if it works on your older one as well, then I consider it good enough to add to my pending PR, but I'll wait till I hear from you first.

Edit:

its the durable auth PR on Main.)

I think this is included in mine? I based my PR off of Main, and not release 0.0.7, so it has everything in Main since then as well as my changes.

(Thanks for the version query tip. That was actually my original plan, but they return the version is an annoying format, spanos2/r202342/04. I didn't feel like breaking out the regex muscles and stripping that down to a simple version number I can >< against, so instead the logic just checks to see if proximityProven exists, and falls back to the old method if it doesn't.)

Rock on! I'm waiting on the restore to complete - Good Ol HA restore process! ...so I figure another hour or so before I can swap branches. I'll uninstall both put in your code then reinstall both. You'll get a test on two different firmware through the onboarding on both.

(I've built my automations around the default entity names the panels bring in, so assuming you didn't change anything in that logic, I only have to update 4 entities after I install to get back in business)

Nope, didn't touch any other logic at all. Going to step away for about an hour myself, so looking forward to hearing back from your testing. If your results match up with mine, then I'll go ahead and merge this branch into both my main, and also my branch driving PR #22 in hopes that it can be merged in soon.

@sargonas - ran out of time tonight I'll get it as soon as I can - Have to do a real work thing first then I'll circle back. If you merge it into your branch before tomorrow LMK, I have not attempted the edits yet.

@nathan-curtis my 202342-fix branch has everything discussed above originally (based of gdgibs main and not 0.0.7) but not the old/new firmware logic, that is in the dual-logic branch. My Main is currently a mirror of the 202342-fix branch as well, but since the open PR is tied to my branch and not Main, I can safely update Main further if that will help you to make testing easier. If it will, then I can merge dual-logic into my main and then you can just drop my repo URL into HACS directly as a custom source... just let me know!

It will absolutely make it easier. I got bit by the node red thing this morning and am just now recovering. I have a few meetings this afternoon and should be back on to test late this afternoon (us Central)

Sounds good. Just updated main for you, so feel free to pull from that!

finally got my install back in a place to do this, sorry for the delay- I'm pulling the code now. Givc me a bit to install / test
Good start - swapped integrations and both panels detected. Have to go outside to prox unlock them...

Install Notes:
Starting New FW panel
Durable auth part is enabled! I WANT to do this but I dont know how to get a token yet I clicked to investigate the UI and it's one way - no return path. Once I go 'token' it doesnt want to do anything else.

  • and We may have our first bug (relevant to this issue though - separate) Reload HA, moving on.
    Reboot of HA - no more detection of that panel - interesting.
    Manual Integration for panel 107 started by IP address.
    Panel mounted no issue. Sensors imported as expected on FW: spanos2/r202342/04
    Have some normal cleanup to make it perfect but panel 1 looks good.

Click Auth token to Configure the panel that way and look around and now I cant go back to manual...

Install Notes:
Starting New FW panel (138)
prox unlock complete
Configure (from detection panel)
Manual
Complete
Presented with Select Area Panel
Integration success - Expected entities found
Panel firmware reports: Firmware: spanos2/r202333/12

@sargonas Well done. Two panels imported both firmware accounted for. - entities look good. Ship It. Bug the config flow issue above as a separate issue.

Wait - something is up with the door sensors (false alarm - needed to config something - ship it)

Thanks for all that! I'm a little lost following the flow of the issue you are reporting, but I'll follow up with you separately on that in a new thread for the bug to get clarity on that.

As to the question about getting an auth token, mbbush, in his ever awesome support of this project, explained that yonks ago in an issue in a different repo: galak#11

(I think am going to add his instructions to the readme I think in my PR as well.)

As of now though, the dual-logic is now merged into my v202342-fix branch and by extension, into PR #22 as well. This issue is now, in theory, pending to be resolved and is configured to auto-close when the PR is merged.

mbbush commented

I'm gonna try to figure out tonight… (because bear with me my python skills are quite rudimentary, and I'm learning on the fly here)… And see if I can't insert some if logic, reinstate the old code alongside the new code, and have it pull the right API value and leverage the right code based on the version number... however I think versions are actually referenced as a slightly obscured string with slashes, and not just a number… So I'm also gonna have to break out the ole regex toolkit while I'm at it lol

Well done.

Accessibility to newcomers is one of the great strengths of the python language, and I think the fix you implemented is perfectly reasonable. I bet you learned a bunch while you did it too, which usually feels good. For what it's worth, what you implemented (conditional based on the keys you're actually looking for) is more robust and a better solution than what you initially proposed (conditional based on parsing the version string).

Thanks, your kind words are much appreciated!