opensearch-project/opensearch-ci

Jenkins Multi Master support

jordarlu opened this issue ยท 11 comments

Is your feature request related to a problem? Please describe

The existing Jenkins CI infrastructure serves as the exclusive system for executing a diverse range of critical tasks, including Gradle checks for Pull Requests (PR), release processes, benchmark tests, and various other functions.

Recently, there were instances of Jenkins performance degradation, possibly due to an escalating workload or other factors , which ultimately resulted in the Jenkins Master node going down, leading to Jenkins service downtime. Details about the most recent incident and the steps taken to restore Jenkins to functionality can be found at opensearch-project/opensearch-build#4130.

We need a long-term solution that will be capable of handling the growing workload to prevent future instances of Jenkins failure.

Describe the solution you'd like

The proposal in high level is to split the Jenkins into multiple Jenkins masters, and each Jenkins handling a set (category) of workloads and is isolated from other Jenkins masters and its associated categorized workloads.

Describe alternatives you've considered

In addition to the proposal mentioned above, we are open to any other proposals and ideas from the community to make Jenkinss even better, please feel free to make comments and describe your suggestions.

Additional context

This issue serves as the main issue to implement Jankins Multi Master support.

As we progress, we will consistently add/update comments, discussions, designs, and relevant issues and PRs to keep tracking all activities.

I still have question regarding this, as in each master will handle a portion of the workflows.

Lets say if the master of build workflows offline, will another master able to pick up the workflow, or we have to wait for the original one to go online again?

Thanks.

@jordarlu Thank you for taking this up. I am wondering if we can just have one more master node added in the existing code with similar settings as the existing one, except for name and labels, and then register it as a new target group under the existing load balancer. We then route the traffic based on url path, e.g., if it is ci.opensearch.org then route to existing master, and if it is ci.opensearch.org/performance then it routes to the new master.
@gaiksaya @prudhvigodithi @peterzhuamazon

I still have question regarding this, as in each master will handle a portion of the workflows.

Lets say if the master of build workflows offline, will another master able to pick up the workflow, or we have to wait for the original one to go online again?

Thanks.

Hopefully once we splilt the Jenkins to process on each category of jobs ( for example, we will have a Jenkins for 'build', another Jenkins for 'gradle-check', and another Jenkins for 'benchmark' ), we won't face this master down issue anymore ( if the mastet down root cause was casued by the workload ), but that is a good point that to have a HA on Master

I still have question regarding this, as in each master will handle a portion of the workflows.

Lets say if the master of build workflows offline, will another master able to pick up the workflow, or we have to wait for the original one to go online again?

Thanks.

I would suggest keeping both masters mutually exclusive of each other and use them to distribute our jobs based on their functionality.

Hey!

Just wondering did we research if having 2 masters will cause split brain issues? Sometime back I had read about this on jenkins forum. Worth researching a bit and experimenting with local set up before we move to implementation. AFAIK jenkins is not supposed to have more masters but I might be wrong and technology might have evolved since last I read but please do confirm.

@jordarlu Thank you for taking this up. I am wondering if we can just have one more master node added in the existing code with similar settings as the existing one, except for name and labels, and then register it as a new target group under the existing load balancer. We then route the traffic based on url path, e.g., if it is ci.opensearch.org then route to existing master, and if it is ci.opensearch.org/performance then it routes to the new master. @gaiksaya @prudhvigodithi @peterzhuamazon

:) wonderful! that is also the direction I learned that we are moving toward to; from the end result, we may end of having https://build.ci.opensearch.org/build/ for the 'build' ; https://build.ci.opensearch.org/benchmark/ for the 'benchmark' ; and https://build.ci.opensearch.org/gradlecheck/ for the 'Gradle Check' ( just name of few to use as an example ... we will certainly discuss how we want to categorize it ) .. thanks for the good suggestion

Hey!

Just wondering did we research if having 2 masters will cause split brain issues? Sometime back I had read about this on jenkins forum. Worth researching a bit and experimenting with local set up before we move to implementation. AFAIK jenkins is not supposed to have more masters but I might be wrong and technology might have evolved since last I read but please do confirm.

Understood ... thanks for bring this up, @gaiksaya , and let me do more reasearch on that ... the original idea was to distribute the load to be on a seperated Jenkins master ( based on the assumption that the master downtimes happened last month were caused by the increasing of workload ) while keeping using the same access FQDN ; but if we can have a way to do HA on master (without causing the issue you mentioned) , that will be even better I believe .. appreciate the consideration on all possible downside of having the HA and the experience sharing ~

Jenkins does not support multi master with Active Active load distribution, assume they have some load balancing with enterprise version https://www.cloudbees.com/capabilities/continuous-integration. However we have two options here.

  1. Active passive Jenkins master to HAproxy (load balancer) in front.
  2. Seperate Jenkins masters for set of builds.

I would go for option 2 as it has many advantages like Jenkins job level isolation, easy upgrades, less blast radius, easy to manage and more. https://welltempereddeveloper.com/ci/cd/2019/04/08/jenkins-ha-multiple-masters.html

Jenkins does not support multi master with Active Active load distribution, assume they have some load balancing with enterprise version https://www.cloudbees.com/capabilities/continuous-integration. However we have two options here.

  1. Active passive Jenkins master to HAproxy (load balancer) in front.
  2. Seperate Jenkins masters for set of builds.

I would go for option 2 as it has many advantages like Jenkins job level isolation, easy upgrades, less blast radius, easy to manage and more. https://welltempereddeveloper.com/ci/cd/2019/04/08/jenkins-ha-multiple-masters.html

Thanks for the insight, @prudhvigodithi , should we explore both options that you mentioned above as they are not interfere with each other? While we seperate Jenkins master per category of workload, we can still have 'sort of' HA on each master to prevent single point of failure ?

Sure Jeff, Once the Jenkins master are split, we are take that up as a new enhancement to add active/passive mechanism, should be easy as the underlying data store is EFS.

I am closing this issue as we are moving on to creating mulitple Jenkins instance instead of spliting the master node, hopefully avoid the confusion between them. Let me also create a new issue to track on multiple Jenkins instace feature.