Unable to SSH into specified service

Question

Unable to SSH into specified service

rocketeerbkw opened this issue 8 months ago · 3 comments

Attempting to SSH into a non-cli service for an environment throws an error. Tracing the logs, the error is "error":"couldn't get executor: couldn't scale deployment: context deadline exceeded".

I initially assumed it was just taking awhile to start the service, but this isn't the case. The error is still thrown when the service has already been up and ready for 40 mins. The ssh-portal connection will fail when the ssh-core is working fine.

After some unknown time, it appears that the ssh portal will start working, so it's an intermittent issue.

The command was lagoon ssh -p <project-name> -e <env name> -s <non-cli service>. The error message for that command is what told me that it was using ssh-portal instead of ssh-core.

Answer 1 · 2024-05-23T23:23:59.000Z

We did some digging and believe there may be an issue when cronjob pods of the service being requested may still exist, or are in an error state. But can't confirm this in the codebase

Try ssh into a specific service and received error above
Check k8s and see that pod of service is running, deployment replicas are valid
Try ssh again and still receive error
Check pod list again and noticed that there was a failed cronjob pod that was started of the same service
- pod status was Failed
Removed failed cronjob pod
Try ssh into service again, this time success

Based on the errors, we believe that the [ensureScaled](https://github.com/uselagoon/lagoon-ssh-portal/blob/6d517c03f8d1ad5dad39112ab83ec7cdecfca152/internal/k8s/exec.go#L121-L141) function is likely where this error is coming from. But don't really see how this would happen due to a cronjob pod of the same service just "existing", since this isn't associated to the deployment.

Answer 2 · 2024-05-24T08:17:17.000Z

As discussed elsewhere: this was caused by a CronJob controller creating pods matching the selector of the Deployment. Those pods were crashing causing the ssh-portal to be confused about the state of the deployment.

This is in general not a supported configuration in k8s as per the docs, so the fix is to avoid creating CronJobs with overlapping selectors with Deployments.

Answer 3 · 2024-06-17T22:28:28.000Z

Will be fixed as part of uselagoon/build-deploy-tool#289