nasa/opera-sds-pcm

[New Feature]: Auto-restart of all PCM services needed for OPS

Opened this issue · 2 comments

Checked for duplicates

Yes - I've already checked

Alternatives considered

Yes - and alternatives don't suffice

Related problems

We encountered some downtime a few weeks ago and one observation was that although AWS EC2 instances auto-restarted, the full stack of PCM services on machines like GRQ did not. This led to a 10h+ downtime until personnel detected the issue.

Describe the feature request

We should ensure all PCM services that are essential for daily operations automatically restart upon and VM reboot or process exit (up to a maximum number of times).

Suggestions on implementation:

  1. Identify all essential PCM services needed for OPS by consulting OPS and PCM teams
  2. Ensure all services wrapped as systemd services
  3. Enable auto-restart policies as needed

CC @hhlee445 - let’s triage this. I can help with step 1 and would be great to work with PCM on 2 & 3.