openSUSE/cepces

Multiple enrollment servers not tried

realroywalker opened this issue · 4 comments

I'm setting up cepces against an MS PKI environment with 2 intermediate CAs, and have a 2 node failover cluster with a CEP instance and 2 CES instances (one for each intermediate CA).
Both the CES URLs are published to the msPKI-Enrollment-Servers attribute in AD.

This is all being done to ensure that CEPCES still functions if a CA is down or under maintenance. I have test, and I'm able turn my initial intermediate CA off and everything still works from a Windows client (using CEP).

However I've found that the cepces client on linux (from this repo) only tries the first CES URL that gets returned from CEP.
I turned on debug logs for the cepces client using the logging.conf and I see in the responses from CEP which has details for both my CES instances get returned, but there doesn't appear to be any attempt to make use of the second one.

Is it possible to get cepces to retry the request using the next CES URL when an error is encountered? - this would add much more resilience to the whole setup.

Could you share a redacted copy of your logs? It would help me pinpoint where to fix this.

Thanks for the quick response.
I have just spent some time putting together a small test setup that replicates what I'm doing, as it's hard to get logs from the main system I'm working on.

I've attached 2 log files (created with the level set to debug), one is a normal request which is serviced when both CA's are online/available (both-cas.log) - in this instance I get a certificate issued no problem.
The second log is exactly the same request (just with a new ID) but in this case my first Intermediate CA (SubCA01) has been taken offline, the second intermediate CA (SubCA02) is still operational and can service requests - however it seems that only SubCA01 is tried.

If I interpret the log files correctly the CEP server does appear to be notifying that there are multiple CES services available when CEPCES requests the policy. I hope the log files help!

subca01-offline.log
both-cas.log

@dmulder do you require any further logs or anything to assist with this issue? - My Dev environment is expiring soon so I can perform any tests etc before it does if that helps.

I can see that we're just picking the first valid endpoint and ignoring the rest (./cepces/core.py:190).
The debug is a little confusing. I don't think cepces is parsing or handling the message response type (RequestSecurityTokenCertificateEnrollmentWSDetailFault), and perhaps is just ignoring it. So I think the issue is 2 fold; we don't recognize that something went wrong, and we don't fallback to the backup CA when something goes wrong.