Resubscribe real-time and cumulative usage after connection outage
JacobWasFramed opened this issue · 4 comments
Currently running the beta with real-time energy usage. I notice that after an internet connection time out, the real-time usage doesn’t resubscribe when connection is restored. The cumulative usage only seems to fail if the outage happens during a reading due to an internet or Duke API outage longer than the 15 minute interval. Seems like any outage requires a restart of HA to restore. I’m unsure if this can be addressed here or should be address in the Duke Python repository.
I think a good solution would be to poll every 1-5 minutes after a connection is out until a successful response is received and at that point resubscribe or reinitialize the integration.
Potentially related, being unable to restart the integration without having to restart HA.
Thanks for bringing this up. When this happened, did you see the integration show a failure in the Integrations page? (it would show with a red outline and an error message).
Honestly can’t recall. I “introduced” a network failure just now and the plug-in did not crash. The cumulative usage was able to resume, but real-time MQTT did not resubscribe.
The errors that I have seen are different than the self-introduced network failure.
The original errors I saw when posting this were as follows:
Error: (update_coordinator.py)
Error fetching homeassistant data: Error
communicating with Duke Energy Usage API: Request
failed with unexpected error [https://cust-api.duke-
energy.com/gep/v2/auth/oauth2/token]: 400,
message='Bad Request', ur|=URL('https: //cust-
api.duke-energy.com/gep/v2/auth/oauth2/token')
Warning: (realtime.py)
Error requesting smartmeter auth, will retry after 5
seconds.
Warning: (realtime.py)
Unexpected message:
<paho.mqtt.client. MQTTMessage object at
0x7fe757d69c80>
Unexpected message:
<paho.mqtt.client. MQTTMessage object at
0x7fe74f6274a0>
What I’ve just experienced from a self-introduced network failure is as follows:
Error: (update_coordinator.py)
Error fetching homeassistant data: Error communicating with Duke Energy Usage API: Request failed with unexpected error [https://app-core1.de-iot.io/rest/cloud/smartmeter/usageByHour]: 504, message='Gateway Time-out', url=URL('https://app-core1.de-iot.io/rest/cloud/smartmeter/usageByHour?startHourDt=2021-11-20T05:00&endHourDt=2021-11-21T05:00')
Error fetching homeassistant data: Error communicating with Duke Energy Usage API: Request failed with unexpected error [https://app-core1.de-iot.io/rest/cloud/smartmeter/usageByHour]: Cannot connect to host app-core1.de-iot.io:443 ssl:default [Network unreachable]
Error: (realtime.py)
MQTT disconnect error, result code: Out of memory. (This may not be accurate)
Both ways though, real-time MQTT doesn’t resubscribe, whether api or internet outage and I have had it where the cumulative was not updating either before restarting HA. I know Duke had an 1-2week outage at the end of October/beginning of November, but this was after that time. Thanks for looking into this; I appreciate your time. If I see an outage again, I’ll make sure to update here with logs.
I had a network outage over the weekend so I did experience this myself. The MQTT stuff does have reconnect logic, but we have to make a separate API request to Duke to initiate that MQTT streams, and I think that API request is the one causing the issues as it does not re-attempt after some period of time. I'll look into it when I have some time.
I believe this should be fixed by the updated pyduke-energy
version in v0.1.0b9 (#53). in the event of a connection loss, the error will be caught and we will attempt to reconnect indefinitely with an exponential back-off strategy (starts at 1 minute, maxes out at 60 minutes). the times can be adjusted if wanted later.
a warning log will be logged with the back-off time when this happens.
I tested this by cutting network to the docker container and it worked for that case. however, depending on where and when the internet drops, a different exception may be thrown. I believe I've caught all of them but there is a chance I missed some. any that I missed will be logged as an error log and we will not retry. please open a issue on that instance.