uselagoon/lagoon-ssh-portal

Unable to SSH into specified service

rocketeerbkw opened this issue · 3 comments

Attempting to SSH into a non-cli service for an environment throws an error. Tracing the logs, the error is "error":"couldn't get executor: couldn't scale deployment: context deadline exceeded".

I initially assumed it was just taking awhile to start the service, but this isn't the case. The error is still thrown when the service has already been up and ready for 40 mins. The ssh-portal connection will fail when the ssh-core is working fine.

After some unknown time, it appears that the ssh portal will start working, so it's an intermittent issue.

The command was lagoon ssh -p <project-name> -e <env name> -s <non-cli service>. The error message for that command is what told me that it was using ssh-portal instead of ssh-core.

We did some digging and believe there may be an issue when cronjob pods of the service being requested may still exist, or are in an error state. But can't confirm this in the codebase

  • Try ssh into a specific service and received error above
  • Check k8s and see that pod of service is running, deployment replicas are valid
  • Try ssh again and still receive error
  • Check pod list again and noticed that there was a failed cronjob pod that was started of the same service
    • pod status was Failed
  • Removed failed cronjob pod
  • Try ssh into service again, this time success

Based on the errors, we believe that the [ensureScaled](https://github.com/uselagoon/lagoon-ssh-portal/blob/6d517c03f8d1ad5dad39112ab83ec7cdecfca152/internal/k8s/exec.go#L121-L141) function is likely where this error is coming from. But don't really see how this would happen due to a cronjob pod of the same service just "existing", since this isn't associated to the deployment.

smlx commented

As discussed elsewhere: this was caused by a CronJob controller creating pods matching the selector of the Deployment. Those pods were crashing causing the ssh-portal to be confused about the state of the deployment.

This is in general not a supported configuration in k8s as per the docs, so the fix is to avoid creating CronJobs with overlapping selectors with Deployments.