[Optimization] Look into if it's worth switching some/most of watchdog to use ec2 networking metrics
Cameronsplaze opened this issue · 3 comments
Is your feature request related to a problem? Please describe.
Right now, the watchdog works by running SSM commands on the host w/ lambda, and pushing the results to custom metrics. If you look in the console, there's already ec2 metrics you can use and filter by autoscaling group. (CloudWatch Metrics
=> EC2
=> By Auto Scaling Group
=> <ASG Name>
. Then check NetworkIn
and NetworkOut
. We can add these two together w/ metric math too).
I don't think we can get rid of the lambda, so is it worth it? (I don't want the container to spin down if someone is ssh'd in, but I'd be surprised if ssh showed much traffic at all. Worth looking into though.).
- If we can get rid of the lambda, then this simplifies the architecture and it's more easy to justify. And maybe if they're ssh'd in but not doing anything (even sftp would spike the traffic), then spinning down is okay?
- If we CAN'T get rid of the lambda, there's still an argument for making it more simple and moving out the non-ssh checks.
To get farther, I need to spin up minecraft/valheim instances, and see what the metrics look like both with/without players connected. Test the following:
- Normal connection w/ Minecraft/Valheim. (Both w/ players, then idle)
- Check trying to connect to a container, no players. Is there enough traffic to keep the container up? (Maybe the container is done updating, so no network traffic on it's end, but is unpacking/installing the update. You don't want it to go down in the middle of this).
- Same, but use hotspot for bad connection. (How much is packet count affected?)
- Test idle SSH.
- Test SSH, with SFTP.
- Test SSH, just doing ls/cd/cat/etc in
home
, then inside EFS mount. - Is EFS used in this metric? Is S3 with #10?
Describe the solution you'd like
If this makes the architecture simpler/cheaper, do it.
Describe alternatives you've considered
The current way the architecture is now.
Additional context
N/A
For Minecraft, metric info:
- SSH'd into Host, doing nothing, sum of traffic over 5 minutes:
- Traffic out: 711k
- Traffic in: 247k
- Total: 958k
- Packets out: 502
- Packets in: 449
- Total: 951
For Valheim, Lot more unstable so lowest values:
- SSH'd into Host, do nothing, sum of traffic over 5 minutes:
- Traffic out: 143k
- Traffic in: 56k
- Total: 200k
- Packets out: 379
- Packets in: 309
- Total: 688
Note, if this DOESN'T change, look at the Watchdog Errors (ContainerManager-*-Stack)
alarm. You can't do 3 in a row, the container will reset and push one green status on re-try.
Decided traffic for SSH isn't worth it. You can just connect to the container the "normal" way at the same time. The architecture becomes MUCH similar (removing lambda stuff), and flexible (can do tcp and udp at same time out of the box) by making this switch.
The trick is to only look at traffic going INTO the container. If you add IN and OUT traffic together, it's too noisy. If the container sends metrics somewhere for example, it can do that whenever, and trip the OUT threshold. By only watching IN, you only see people connecting, or container Downloading (updating) something. Either case, you don't want to shut down.