unfoldedcircle/integration-home-assistant

Cannot connect to HASS "Protocol error: I/O error: unexpected end of file"

Closed this issue · 11 comments

ijc commented

When I attempt to configure the HASS integration it spins for quite a while at the "Select entities" page (with nothing to select) before eventually failing with error.ECONNABORTED. After that at the integration config page both the "Configured Entities" and "Available Entities" columns say "Not Available". The left hand column has a disconnected toggle which I cannot change.

I've been seeing this behaviour for several updates.

Screenshot from 2024-01-20 17-07-36
Screenshot from 2024-01-20 17-07-47

After restarting the remote its a bit different but still no entities:

Screenshot from 2024-01-20 17-17-55
(after a bit the red "error.ECONNABORTED" appeared at the bottom of the "Available Entities" column here too)

I'm using wss://«my internal domain»/api/websocket as the endpoint and with tcpdump on the server I can see traffic on the websocket: auth_ok, subscribe_events and get_states with success results and a big list of entity states. TLS certs are from letsencrypt and lighttpd is providing the forwarding -- there are no errors in the lighttpd logs or the packet capture.

Firmware is v1.6.4 and HASS integration says it is v0.6.1-dirty. HASS itself is 2024.01.4.

Logs:
UCR2_logs_core_scrubbed.txt
UCR2_logs_hass_scrubbed.txt

Is your HA server in your internal network?
If so, have you already tried with ws:// instead of wss://? Do you still get the same error?

This might also be related to a high number of entities in Home Assistant. Unfortunately the HA WebSocket API is very limited and doesn't allow filtering for retrieving (get_states) or gettting notified (subscribe_events ) of only certain entities, and we have to get all entity states after the connection is established. I'd be happy to be wrong here though. I can dig into the HA WS source code to see if there are any hidden features...
The Configure enhanced options allows increasing the WebSocket frame size (maximum message size). Try doubling the value to see if you get any entities back.

ijc commented

My HA server and the remote are on the same vlan. However my HA only exposes an SSL endpoint so ws:// won't work.

I'm relatively new to HA so I expect I've got a relatively low number of entities.

I wasn't able to setup advanced options on the existing instance, trying to go through the "Start Integreation Flow" just timed out. Even deleting it and starting again didn't work. I had to delete it and then reboot the remote to be able to go through the flow again. That seems like an issue worth addressing apart from the issues with websocket etc.

With that done I was able to restart the setup flow and set the max websocket frame to 16384 which is the max it allows. Unfortunately I got the error.ECONNABORTED again.

I tried restarting the remote again, since sometimes that seems to let it make progress, but no joy.

I also tried upping the connection timeout to 15, no luck with that either, `error.ECONNABORTED

ijc commented

I managed to build and run the integration externally which makes poking around and debugging a lot easier.

The "hang" appears to be happening when we get a protocol error in the StreamHandler resulting in the

Protocol error: I/O error: unexpected end of file

Which then seems to live lock, I wonder if the ctx.notify tries to send further traffic over the ws and so we end up looping? If I replace the ctx.notify with ctx.finished then that seems to avoid the issue. I've no idea if that is correct though.

The other issue I see is "Error parsing json message" from json_object_from_text_msg. It's quite a large message: 207,474 byes. If I dump it then it looks like maybe a chunk is missing from the middle. jq also says it is invalid and it points to a section like:

{"entity_id":"sensor.octopus_:{"id":"01HN

Where it looks very much like some content has gone missing after octopus_. This is 180,228 bytes into the message, so it's not a multiple of the 16384kb max frame size that I've configured or any other interesting seeming offset that I can think of.

I'll keep digging and see if I can figure out where the data might be going missing.

ijc commented

Running tcpdump on the HA side of things I can see that the message is well formed at that point.

@ijc thanks for looking into it!
You were faster, I'll start writing a small cli tool this weekend for testing HA server connections on a PC. It uses the exact same code and is a simple wrapper to provide an URL & access token. This will allow using the full debug log and investigate what's going on.

Which ctx.notify are you refering to?

ijc commented

I realise I didn't say (although maybe it's obvious) but the big payload with the missing chunk is the reply to the get_states.

Also while the request fails the vast majority of the time it's not not every time -- I have seen it succeed (i.e. just now on the 14th go) and then the list of entities is properly displayed in the web configurator.

I also tried upping the connection timeout to 15, no luck with that either, `error.ECONNABORTED

The connection timeout setting in the integration only set the overall request timeout, but not the TCP connection timeout. That was always using the framework's default 5 seconds. This is fixed now and can be configured in addition to the request timeout.

When I updated Home Assistant to 2024.1.6 last weekend, I've noticed connection timeouts for the first time as well, but only from my development machine and not the remote. Maybe just a coincidence... But increasing the connection timeout solved it.

I wonder if the ctx.notify tries to send further traffic over the ws and so we end up looping?

error!("[{}] Protocol error: {e}", self.id);
ctx.notify(Close {
code: CloseCode::Protocol,
description: Some(e.to_string()),
});

This Close message notify initiates a graceful WebSocket disconnect by sending a close message to the server (https://github.com/unfoldedcircle/integration-home-assistant/blob/v0.6.1/src/client/close_handler.rs#L16). For a protocol error it's probably best to immediately disconnect and to no longer use the connection.

Looping should not happen. The WS connection is force-closed if the close frame could be sent but no response is received, or if message sending failed. But I've never seen a protocol error or corrupt messages from HA. This is indeed very strange.
I'm also close to your response payload with a bit over 190 KB. The only difference is, that I'm not using wss for my internal HA server.

To simplify testing I've added the ha-test tool. Binaries for Windows, Linux, macOS are attached to GitHub releases. It's a simple connection test to the server and requesting the entity states. Usage is shown with --help or in #49.

ijc commented

Thanks, I'll try the new tool ASAP (might not be until the weekend though).

ijc commented

I've tried the tool, very useful for isolating things.

I've injected an socat between the tool and the lighttpd proxy on the HA system (which terminates tls) so I can use ws:// rather than wss:// and that let me capture a tcpdump at the client end (before socat) and at the server end (between lighttpd and HA).

I seem to have managed to catch using tcpdump a case where I see different bytes captured at those two ends for the corrupt JSON object.

I think that would point towards a problem on my end (probably lightpd config or bug) and not with this integration.

ijc commented

Apologies for the lack of updates.

I never did figure out what was going on here, however I ended up deciding to run the integration externally as a container alongside my hass instance rather than on the remote and that seems to have avoid the problem -- probably because it can use HTTP to localhost instead of HTTPS.

There was a new issue though:

I was using wss://<DNS NAME>:9443/ws to refer to the external integration (where <DNS NAME> has a valid letsencrypt cert) but was seeing quite frequent "session not available" in the logs -- particularly after waking the remote at the end of a TV show to turn everything off (the amp is controller via HASS). After 5-10s it kicked into life and I could turn the amp off.

I think this corresponds to Error connecting to driver wss://<IP ADDR>:9443/ws: ServiceUnavailable("Failed to connect to host: presented server name type wasn't supported")in the logs from the remote (where <IP ADDR>is the correct one for <DNS NAME>).

I've switched to ws:// URL, will monitor that for a bit.

I'll open a new issue if things persist since I'm sure it's something different to this one.