sfeakes/AqualinkD

Hayward AquaRite utterly confused by AqualinkD

christiaanbrand opened this issue · 30 comments

So here's a fun story:

Earlier this year I upgraded my Jandy RS16 from "Q" to "T.2". I wanted to control some new IntelliBrite lights so had to do this upgrade (they're not supported by the older CPU revision). Round about the same time I changed my aqualinkd.conf to enable:
rssa_device_id=0x48
and
extended_device_id=0x31

Here's where the fun started: I also have a Hayward AquaRite SWG (latest version = 1.59) hooked up to RS485. It used to work great. However, I started to notice that it's no longer producing chlorine reliably. Upon further investigation I saw the the system go to "Generating Chlorine", and then stop, and the "No Flow" light was blinking. About 1 minute later, this would restart.

I suspected the flow switch. In order to figure out whether it was actually the issue I plugged in an RJ11 into the flow switch plug and shorted pin 3/4 (simulating a "closed" flow switch). Problem still remained. So, I thought: it must be the SWG mainboard. Bought a new one from Hayward. Same problem. Next I replaced the flow switch for good measure. Problem remained.

Then I had the bright idea to unplug the RS485 cable. Problem went away. I then plugged it back in and shut down aqualinkd. Problem stayed away. Started aqualinkd back up. Problem came back.

Next I'm going to disable rssa_device_id and extended_device_id to see if that makes any difference. Please let me know if you have any ideas or things I could do to troubleshoot.

(In the midst of this I also locked out all SWGs on my Jandy system - I forgot to set the emulation of the new board to "Jandy" before connecting it up, but found someone online who helped me fix that with a hex editor 😊)

It's funny you should mention this.. I have had similar strange behavior with AqualinkD + Jandy Aquapure 1400.

In my case, more often than not, if AqualinkD was running when I turned on my system in the morning, the main control panel screen would not register the SWG as even connected.

To resolve, I'd have to go and hard reset the Jandy Aqualink at the circuit breaker box.

I was able to figure out a workaround by setting up a cron job to stop the AqualinkD at ~ 2am PST and then start it back up ~ 10 minutes after my first program of the day would start running.

Ever since doing that, I haven't seen it pop up again.

May be worth a shot.

Same... It seems to happen when you drop or lose power on the Aqualink control panel. I just stop and restart AqualinkD and everything works fine again. (I also have a cron to do it twice a day in case of a power hit.) I use a Chlorinator Translator which allows a Pool Pilot SWG look like an Aquapure and prevents the lockouts.

I've confirmed it. extended_device_id is the problem. Doesn't matter which IDs I use. The moment I enable it, the AquaRite goes crazy. When I disable it, everything's fine. It's very sad, because I have two VSPs and Intellibrite lights that need extended_device_id, but I need chlorine in my pool more than I need those, so I guess I'll have to live with it for now.

Let me know if there's a way to debug this - would be happy to try and get to the bottom of it.

@christiaanbrand can you please list everything on the RS485 bus, including all keypads.
You were using Aqualink Touch ID (0x31) for extended programming, you said you tried other IDs, did you also try all the One Touch ID’s as well?

Valid ID's are 0x40, 0x41, 0x42 & 0x43. for ONE Touch
Valid ID's are 0x30, 0x31, 0x32 & 0x33. for Aqualink Touch

few other items to detail that would help.
What else do you have running on your Pi?
can you run the script ‘pi_health.sh’ in the extras directory and post the output.
Can you also detail what you did to the control panel to clear the SWG lockout.

SWG are the most finicky devices on the bus, the only way I can see this style problem happening is if AqualinkD is replying to the wrong messages or it’s replying late to messages, so the sequence is out. Late would more than likely be a Pi performance issue.

I tried 0x41, 0x42, 0x30 and 0x31 - they all had the same problem with the AquaRite.
(0x40 crashed aqualinkd - but I have a OneTouch panel so maybe it's because that's already using that address - didn't dig in too much).

On the RS485 bus I have:

  1. 1x OneTouch panel
  2. RS16 (which is split into two 8-switch units, connected to each other using RS485)
  3. 2x Pentair VSF pumps
  4. Raspberry Pi with RS485 adapter
  5. AquaRite SWG

On the Pi I'm running NodeRed (which connects to aqualinkd using MQTT).

Output of pi_health:
root@raspberrypi:/home/pi/software/AqualinkD/extras# ./pi_health.sh
CPU temperature: 51.1
CPU Voltage: 0.8600
Undervolt : OK
Undervolt history : OK
Throttled : OK
Throttled history : OK
Frequency Capped : OK
Frequency Capped history : OK
root@raspberrypi:/home/pi/software/AqualinkD/extras#

To clear the SWG lockout (but this issue I'm reporting existed long before I did that):

  1. Use Jandy's Windows software to read all the memory from the RS16
  2. Use a hex editor to change the value where they "remember" they saw an incompatible SWG
  3. Write back the changed image to RS16 using their utility

I can reproduce the issue rather easily - would it help if I logged everything happening on the bus?

Any idea what I could try? Would love to get lights and speeds working again.

Thinking about this some more: Communication with the SWG should be solely between the SWG and the Jandy RS16, right? Aqualinkd should never try to talk to it? Which means that technically, even with aqualinkd sent out-of-order packets or something, the SWG should ignore it because it should never be addressed to it?

That makes all this feel very much like a buffer overflow issue where the SWG literally crashes and reboots. Is there any way to look at the size of packets going out? And would these necessarily be larger if extended_device_id is configured?

The Jandy protocol has no from in the packet, so if something is sent late, ie out of order it has no way to know.
It’s not the SWG that’s confused, it’s more lightly the control panel is confused and not sending the right signals. You are right AqualinkD does not talk to the SWG directly, but if it sends a message to the control panel, when the panel is expecting a message from the SWG things can get messed up. But this is just a theory at the moment. A packet can be late from either AqualinkD not responding quick enough, or the Linux kernel not handling (or buffering) the USB2RS485 communication. The latter is where I’ve seen some issues.

The only real way to test this is by adding another device on the buss to monitor the traffic.

The first thing I’d try (without having another device to monitor) would be to compile AqualinkD on the pi you are using. Instructions are in the wiki.

But won't the SWG still ignore anything not addressed to it?

Anyhoo - I just compiled aqualinkd on the Pi, but it still does exactly the same.

I do have a second Pi and RS485 adapter. Let me know how we can put those to use to figure this out.

If the control panel doesn’t sent “heart beat” to SWG then the SWG will go into manual mode, and I think this maybe what’s happening. Control panel getting messages out of order will potentially make it stop sending the heart beat.

with a second pi and RS485 adapter, use serial logger (in this repo) in debugging mode, then config & start AquLinkd on a seperate pi so it has the problem. If you post the output from serial-logger that should show if AqualinkD is creating the problem. Please also post AqualinkD config, so I’ll know what IDs it’s using.

I tried one more thing: While the SWG is working normally, I tried removing the RS485 cable. It does not have the same effect. It simply continues chlorinating and switches over to manual mode. The "no flow" light does NOT start blinking like what I've been experiencing.

Please find some files here:

My current config: https://drive.google.com/file/d/1K4J5vktVrqK9gvQtUn1yr5kgy6cXevx5/view?usp=drive_link
(Note that extended_device_id is currently commented out. When I uncomment it, the problems start).

Here's a capture I took yesterday of everything working normally. Extended_device_id commented out:
RS485: https://drive.google.com/file/d/183pahKk3UZqaHNzhqX-aUY4E7EqXZOcf/view?usp=drive_link
RS485raw: https://drive.google.com/file/d/1-E95LM3gaSe00Y1MwyrTyQH9c4sy04n7/view?usp=drive_link

Yesterday, I could NOT reproduce the issue. Even when uncommenting extended device id. But, when I got to the SWG today, it would just continously blink "No flow". When I shut down aqualinkd or uncomment extended_device_id it would return to normal. Here's a capture I took while it was doing the "no flow" thing:
RS485: https://drive.google.com/file/d/1p8zJiT74-tVihaB8cCLJAonvBgcrd9ie/view?usp=drive_link
RS485raw: https://drive.google.com/file/d/1VJ6ZSOCjqc6peBzjfZsa-yzkcC4yOl1P/view?usp=drive_link

Please let me know if you can see anything. I'm at a loss here.

(Separately, if you can use a working RS16 P&S version Q CPU board let me know. Happy to ship to you).

Any idea from the logs?

One additional thing I noted is when aqualinkd acts up like this, on the Aqualink OneTouch control panel inside the house, the SWG % goes to 0%, and my IntelliFlo VS pumps either report as "offline" (even though they're on) or reports some ridiculous amount for watts (like 58995). Stopping aqualinkd immediately makes the panel return to normal. Also, as mentioned earlier, when not running in "extended_device_id" mode, this issue doesn't occur.

In your config, you have some ID’s commented out, you should I comment and set these appropriately.

# device_id=0x0a
# rssa_device_id=0x48

Can you try disconnecting your one touch, see if that makes a difference.

I went over the logs, and I did not see any messages being out of order, so it doesn’t look like it’s that. But I do need to go over and decipher the commands the control panel is sending / requesting next. That will take some time.

yeah, I saw that I accidentally had them commented out, but from the logs it looks like 0x0a was used even though I didn't set it explicitly. I did add them back in, but same problem persisted.

I'll try to disconnect the OneTouch to see if it makes a difference.

Unplugging the panel didn't fix things :(

Unplugging the OneTouch, leaving it unplugged, and starting AqualinkD didn't work? I was looking over your logs and thought maybe I had an idea of what was going on, but if that didn't work it's probably not what I was thinking.

In you logs, I can see that the control panel is actually telling the SWG to go to 0. That's causes the problems. Why the control panel is doing that, is what we need to solve. It "looks" like if the panel doesn't talk to the Pentair pump within ~500 messages it assumes pump is offline and there for set SWG to 0. If it get's a pump messages within ~300 messages all seems fine.
So finding anyway to reduce messages is what I was thinking as a test to that theory.

I just unplugged the panel while the issue was occuring. I didn’t stop and start aqualinkd. Left it for ~5 mins but the issue remained.
Usually when I stop aqualinkd the issue is resolved straight away.

Should I try it again, but this time restart aqualinkd?

So, what you said made me think:

I reproduced the weird behavior if I unplugged the rs485 cable of the intelliflo pump while it was running. The swg then gets set to zero.

So the issue is definitely that somehow running aqualinkd with extended mode disrupts the comms with the pump, which is what causes all of this.

This is 100% exactly what I have experienced over the last couple of years. Pentair VSF, Jandy, Hayward Goldline SWG. After chasing every other possibility I could find (new wiring, different usb adapters, rewiring the connectors) etc. It really does seem like its a strange issue between the pump and the jandy control board ONLY when aqualinkD service is running with extended ID options. Doing the same tests as christiaanbrand I am getting the same results.

My previous solution is to basically just shutdown the service for a day or so, then starting it back up. For some reason it works after that. I have also just started testing what happens with more combinations such as priority mode set, while also disabling the various extended programming settings as christiaanbrand has said. So far, keeping "#extended_device_id_programming = yes" commented out may be helping, but will need to test for a while.

Responded while updating my comment. :-). Yea.. that is working, but without VSF, however the 3 tests so far tonight with "#extended_device_id_programming = yes" commented out, but "extended_device_id=0x30" active seems to be working to program VSF while also not having the pump/swg issue.. so far..

Well.. never mind. Although not getting the Watt/RPM crazyness, but the SWG does continue to bounce with only "extended_device_id=0x30" enabled. Guess it is back to more testing or shutting down aqualinkD service for a while then restarting.

I think I have tracked this down, it's actually a bug in some versions Jandy firmware. I'm just not sure how to fix it yet. Well I can't fix it, but potentially modify AqualinkD so it doesn't get raised in the firmware.
This ONLY happens when you have a SWG, Pentair VSP, and multiple One touch or iAqualink connections.
You don't even need AqualinkD to cause the problem, but since AqualinkD uses OneTouch or iAqualink connections for VSP information, that's why you are seeing it, and why Turing off extended device programming & extended device id will fix the issue. (You could remove your Jandy one touch or iAqualink remotes, and that should also fix the issue).

It's actually the Jandy control panel firmware communication with PentairVSP that causes the problem. The control panel expects to see information from the VSP every ~300 messages, if it doesn't it thinks the VSP is offline and turns off the SWG (to protect it). Problem is with multiple OneTouch / iAqualinkD connections that are very noisy, it's sometimes ~500 messages between VSP messages.
What's even more interesting is it's the control panel that sends the heartbeat message to the VSP, so it's actually a bug in the firmware as it could send that more frequently, and actually does if it's a Jandy VSP. That's why it's limited to a specific setup of SWG & Pentair VSP.

Sounds like a potential solution is getting closer. One thing to add, the ONLY control device that I have active on the bus is the AqualinkD. I keep my iAqualink powered down (toggle switch on the power + wire) all the time as I had some indicators that it may be more reliable without it. (Although it was not consistent and would work with both on the bus sometimes).

I think the bug is a bit more nuanced than just the number of devices. I have noticed that once I get aqualinkD started and stable, the system will run indefinitely without a single bounce until I restart aqualinkD or the host computer. Seems like some sort of timing on initialization or maybe the RS is caching some sort of information that eventually clears.
Also, one of the additional symptoms is the reported RPM and Watts from the RS goes haywire when SWG problem manifests.
Could a workaround be AqualinkD sending a heartbeat to the pentair which the RS will detect

@christiaanbrand , I did some more testing this weekend. My previous set up was having my SWG (Hayward Goldline/Aquarite) behind the pump relay powered by the pump circuit breaker while the pump itself is directly connected to the breaker. The SWG would power on/off when the Jandy control triggered the relay to shut off the pump (although the pump controller remained powered).
Looking at both the SWG and Pentair VSF manuals, when controlled via automation, it is documented that both the SWG and the Pentair connect directly to the main and let the automation control them both. Not having other ideas, I re-wired the power to the SWG so that the entire unit does not power down.

I ran some tests and although the first time I started getting bounces every couple of minutes, flipping from AqualinkTouch to OneTouch (0x30 -> 0x40) and restarting stabilized it. I have noticed that sometimes switching between extended programming protocols and restarting AqualinkD can help, although its success is not consistent.

I have also noticed that sometimes powering the breakers in order can help stabilize it. For example, Power Jandy breaker, let it go through its boot sequence which takes about 15s, and then powering on the pump/swg) .

My SWG is always powered on, so I kinda already had the setup you described. I've been having success by just running without extended mode now for a couple of months. That works fine.

@christiaanbrand @niharmehta

I have just updated AqualinkD to version 2.3.3 and think this may have been fixed. If you pull the latest version and add the below to aqualinkd.conf rs485_frame_delay = 4 see if that helps.

Thanks. Will give this a shot and report back.

Apologies for the delay. I think that fixed the issue for me :) I haven't seen it re-occur since I set that new value.

I'd suggest you set that as the default going forward maybe?

I guess I still have a similar issue with my SECOND pentair pump. Only when running aqualinkd, the pump will go offline the whole time. Whether I’m running in extended mode or not, or whether I have the frame_delay specified or not. This only applies to the second pump. Pump 1 is now rock solid when specifying frame_delay.

Have you played with the default setting of 4? If not, can you try increasing it by steps of 4 until the problem goes away, or you hit ~30.