[New Feature]: Auto-restart of all PCM services needed for OPS
Opened this issue · 2 comments
riverma commented
Checked for duplicates
Yes - I've already checked
Alternatives considered
Yes - and alternatives don't suffice
Related problems
We encountered some downtime a few weeks ago and one observation was that although AWS EC2 instances auto-restarted, the full stack of PCM services on machines like GRQ did not. This led to a 10h+ downtime until personnel detected the issue.
Describe the feature request
We should ensure all PCM services that are essential for daily operations automatically restart upon and VM reboot or process exit (up to a maximum number of times).
riverma commented
Suggestions on implementation:
- Identify all essential PCM services needed for OPS by consulting OPS and PCM teams
- Ensure all services wrapped as systemd services
- Enable auto-restart policies as needed