Retrospective for October 2020 releases
adamfarley opened this issue · 37 comments
Topics for the retrospective should include:
- How rebuilds were required on docker platforms due to the introduction of the ActiveNodeTimeout feature the week previously.
- How the test jobs didn't launch within the parallel groovy runTests method (openjdk_build_pipeline.groovy) if the openjdk-jenkins-helper library isn't loaded prior to the build (or, perhaps, at least outside the parallel code section).
- How the test jobs being unable to launch somehow didn't cause build failure.
- Why the Windows 64bit build here failed after complaining: "warning: failed to remove openj9/test/functional: Directory not empty". Raise issue or dismiss?
- Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?
- Should all calls in the community be open? Is there scope for limited-access calls (beyond the TSC), such as 121 calls? Is it fair for these to occur when the call has been mention in advance, in a public channel?
Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?
Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.
Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?
Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.
Agreed. Or automation. Or an automated checklist. Let's discuss during the retrospective meeting.
We should only run the main 3 platforms first for both OpenJ9 and Hotspot then run pipelines for the secondary platforms.
We need to ensure we have enough hardware to cover a full release with weekly tests for all platforms.
Why were the nightly builds left running during the release (e.g. JDK15)? Could they be paused during release week until we're sure all of the pipelines are complete?
Isn't mentioned in https://github.com/AdoptOpenJDK/openjdk-build/blob/master/RELEASING.md as far as I can see. So we all forgot about it. We need checklists.
I switched off /testing/ of the nightlies via the default checkboxes in the openjdkxx-pipeline
jobs (since that's the bit that's generally disruptive) but it seemed to get re-enabled somehow - maybe by a pipeline regeneration?. I queried what the best way to do it this time would be in https://adoptopenjdk.slack.com/archives/C09NW3L2J/p1602789550128600
Entire build can be stopped by adjusting the triggerSchedule
in pipelines/jobs/configurations/jdk*.groovy
, or to switch off the tests the lines in the jdk*_pipeline_config.groovy
files need to be modified to have false
in the test
fields.
We need to ensure we have enough hardware to cover a full release with weekly tests for all platforms.
If we're going down that route we should implement a formal platform tier proposal (which could lead to interesting discussoins, but I'm guessing you're thinking about x86 win/mac/linux as primaries for now?) . Being devil's advocate, is there a specific problem you see that means those should be kicked off first? Obviously the others aren't competing for the same resources (unless we push the OpenJ9 XL ones out of "primary")
Retrospective item: I feel a lot of disucssions over the last 18 hours seem to have been done outside the #release channel in slack. We need to make sure current status of release-related activity is done in one place (including initiation of any calls) to make sure we're all up to date and pulling in the same direction.
Despite commenting out the default weekly map there were still instances in the jdk_pipeline_config.groovy files which stacked up weekly tests on platforms that didn't have enough hardware to support the run (e.g. Java 11 aarch64).
The default map is comprehensive and we should use that going forward and simply get our infra support up.
A secondary concern is that we are more explicit that we're using a default weekly map in the jdk_pipeline_config.groovy - the naive engineer may get confused overseeing an empty map in most cases.
"Handover situations": If builds go on for several days for whatever reasons, it is not necessarily the case the same person will be handling a given release. We need to make such handovers easier, rather than trying to figure out from numerous slack messages in various channels. A more focussed/managed release checklist with status ? (@smlambert I know you've mentioned this previously)
re: #181 (comment) - yes @andrew-m-leonard, see #178 for a WIP checklist that is intended to make it more obvious what has already occurred and by whom.
Issue: Job generation doesn't appear to be reliably thread-safe, especially the concurrent test job generation we do at the end of a build.
Evidence: Groovy's struggle to load the same library in multiple concurrent threads (runTests() in openjdk_build_pipeline.groovy), and the non-fatal "No suitable checks publisher found" issue that springs up in many test runs [(Slack thread)].(https://adoptopenjdk.slack.com/archives/CLCFNV2JG/p1603464619103400)
Potential solution: If there's a way to launch jobs in a non-blocking way, we could loop over the job-generation step for each test job we want to run after a build (in a single thread), and then "check" for job results in a second loop. Once we have "results" for each test job we generated, the second loop breaks out and we continue.
re: #181 (comment) - what are you intending to solve? Is it meant to address the question: How the test jobs being unable to launch somehow didn't cause build failure ?
If so, perhaps some background:
- we originally reported and failed build pipelines on child job failures (including test jobs launched from main pipeline)
- build pipelines were refactored/rewritten
- since we can never get through a build pipeline without something failing (thus focus on fixing issues found in triage), we made a conscious choice to ignore test failures to allow build pipeline to complete.
- if we want to change the earlier conscious decision, then we could choose to do something other than simply print the failure
But maybe I misunderstand what your comment is targetting...
Why the Windows 64bit build here failed after complaining: "warning: failed to remove openj9/test/functional: Directory not empty".
This is a known, long-standing, problematic issue that appears to have triggered the raising of many infra issues in the past, where Jenkins jobs are unable to clean out the previous workspace (or their own at the end of their run) and other jobs fail with AccessDeniedExceptions.
All of these issues relate to the same core issue:
adoptium/infrastructure#1573
adoptium/infrastructure#1396
adoptium/infrastructure#1527
adoptium/infrastructure#1419
adoptium/infrastructure#1410
adoptium/infrastructure#1394
adoptium/infrastructure#1379
adoptium/infrastructure#1376
adoptium/infrastructure#1339
adoptium/infrastructure#1328
adoptium/infrastructure#1310
adoptium/infrastructure#1086
adoptium/infrastructure#962
adoptium/infrastructure#810
adoptium/infrastructure#784
adoptium/infrastructure#736
adoptium/infrastructure#706
adoptium/infrastructure#477
adoptium/infrastructure#417
adoptium/infrastructure#23
We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.
re: #181 (comment) - what are you intending to solve?
The problems in "Evidence", which could perhaps be renamed to "Symptoms". Now we have two issues that could be traced back to us trying to use concurrency and build generation together. I was spitballing a simplistic way for us to achieve multiple concurrent jobs, while generating them in a serial manner (possibly avoiding the non-thread-safe(?) build generation).
The test jobs failing to run is a symptom. The fact that their failure didn't cause the build to fall over is either a non-issue or a separate issue.
re: #181 (comment) - We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.
Seems reasonable. I recall a while back there was a discussion over nuking the workspace at the start of every run, by default. Do you remember why we opted not to?
Can you find the discussion and indicate what is meant by 'nuking'? There appears to be a great many comments in the open/closed infra issues listed above.
From a test pipeline perspective, the best nuking we could do means call cleanWs(). We used to do so at the start of each test run.
Then, do to pressure to not take up space on machines, we moved it to the end of every run, adoptium/aqa-tests#314.
We could call cleanWs() both at start and end of each run (taking a small hit on adding some execution minutes), but the core issue is that the cleanWs() call sometimes fails to work when run on Windows machines no matter when or how frequently you call it.
All of this is perhaps a non-issue if we spin up fresh machines on the fly, but we are not really there (and not sure if that is in our infra goals or not).
Shenandoah was not enabled for the JDK11u release - fix in adoptium/temurin-build#2177
Seems reasonable. I recall a while back there was a discussion over nuking the workspace at the start of every run, by default. Do you remember why we opted not to?
If the directories are somehow locked in a way that it cannot be deleted, that won't achieve anything.
We should find a way to address the issue with more than the temporary approach of rebooting a machine to clear out old workspaces, as we will continually be plagued by it until a more proactive solution is applied.
I think @Willsparker has deal with more of these recently (so may have an idea of how to fix properly, but let's go into that in a separate infra issue) and has been able to diagnose some of the locked workspaces, but I agree it is probably our most common recurring issue and we need to understand and resolve and try to write some auomated mitigation going forward.
apt installers for 8u272 suffers a gap in update time which affects end users - infra#1647
Getting into the realm of solutions here already, but something I've been doing in the infrastructure repo and I thin kwe should roll out to at least the build one.
- Wherever possible the person who creates any PR should merge it (where they have the authority to do so)
- If they do not have authority, agree with someone with authority when it will go in
Both of these support the following:
- The person who icreated the PR is then responsibily for making sure it has the desired effect, either by a full test build run, or paying good attention to the next nightly
I think this would make the merging process less error prone and avoid "fire and forget" PRs going in without being verified, which we seem to have had quite a lot of in recent months. I'm loathed to add an extra "verify" step to the workflow but maybe we do need to say that something shouldn't be moved to "Done" until something has been confirmed.
Also I suggest we split out the "promote release" issues into HotSpot and OpenJ9 ones for various reasons:
- To ensure there is no confusion amongst the comments as to which one is being referred to
- To allow the issues to be closed (and if necessary re-opened) separately
- To ensure that TSC approvals to ship are more clear historically (We should formalise the format of such an approval)
FYI, adoptium/infrastructure#1573 is where I'm looking at the Windows workspace based issues. The main issue is leftover java.exe
processes stopping Jenkins from deleting workspaces.
Proposal: "Dry-run" Release
How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?
re: #181 (comment) - Can you find the discussion and indicate what is meant by 'nuking'?
I think I meant just running cleanWs() at the start of each run, though as Stewart says:
re: #181 comment - If the directories are somehow locked in a way that it cannot be deleted, that won't achieve anything.
So perhaps one way forward is to run cleanWs() at the start and end of each run, as Shelley suggests, and to answer every instance of issues like "locked folder" with a fix in cleanWs() that makes it more effective.
comment #181 - Proposal: "Dry-run" Release
How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?
Seems reasonable. We should also aim to cut down build/test/etc repo contributions during the "dry-run & release" period, so we can avoid new issues sneaking in after the dry-run but before the release.
How about on the monday before the tuesday release, we do a "dry-run" Release run-through, without the obvious "Publish" at the end ?
We could also do it as soon as we enforce build repo lockdown, which varies but is usually on the Thursday/Friday before release. That way nothing else should be going in. Of course it depends how quickly we think we can fix things if they are faulty :-)
Docker release of arm32v7 had not appeared by today (10th november) despite being shipped about 14 days ago
a high level suggestion from a observer: a possible mitigation of those kind of issues would be to let adopt build rc builds too. OpenJDK has rc builds at least a month before release. If adopt would build them too (as if they would be a release), potential issues could be noticed much earlier and likely solved until release. This might cause a boring release week though ;)
@mbien Haha a boring release week sounds like bliss! I think we will end up doing some sort of pre-release trial. A month is possibly a bit too far for us because a lot can happen in the month before GA when we're on a three month release cycle (and it's quite rare that a code issue from openjdk trips us up) :-)
Thanks for the input
Issue with macos packaging missing JREs: AdoptOpenJDK/homebrew-openjdk#495 (comment)
Multiple issues relating to the 11.0.9.1+1
version string which we had to address both in the build repository and the API:
- Issue: No group with name pre
- build PR (AL): Add version support for
- bulid PR (AF): Pass patch number into installer job
- build PR (AL): Computer semver from openjdk patch and build
- build PR (AL): Allow for null patch for sorting
- install PR (AF): Allow five version string componenets
- api PR (JO): Semver incompatibility fix
- api PR (JO): Possible solution to java patch issue
Summary of everythign above (a.k.a. an easy-to-use agenda for the meeting to be held on Monday at 1400 GMT/UTC). The initials of the person who raised it in the conversations above are in []
One-off things (likely don't need much discussion)
- [AF] ActiveNodeTimeout introduction caused docker-based builds to fail
- [AF] Test jobs sequencing issue with openjdk-jenkins-helper not being loaded
- [AF] Directory not empty issues on Windows
Issues:
- [**] Nightly builds were not fully stopped during release (SXA to document based on slack discussion)
- [MV] Weekly runs should also be disabled during the release
- [SA] Discussions on releases happening outside #release makes it hard to keep track of release activit
- [AF] Job generation not thread safe. "No suitable checks publisher found" warnings.
- [SA] apt installers for 8u272 suffer a gap in update time which affects end users (doc so releasers are aware?)
- [SA] Shanandoah not enabled for the JDK11u release
- [SA] When things are merged, ensure someone is responsible for verification to avoid breakage
- [SA] Docker arm32 release took a long time to get published
- [SA] Macos packaging missing JREs
- [SA] Build repo lockdown had some "leakage" which broke Solaris/SPARC (& others?)
- [SA] 11.0.9.1+1: Various issues with the patch number in build and API
- [SA] Releasing doc missing info on updating tags
Questions:
- [AF] Should all calls in the community be open? Is there scope for limited-access calls (beyond the TSC)
- [MV] Should we prioritise platforms e.g. run Windows/x64, Linux/x64, Macos/x64 pipelines first?
- [AL] How can we make handovers easy from one team member to another during releasing if required?
- [SA] Should we have separate HotSpot/OpenJ9 release issues to avoid platform confusion and allow closing each separately)
- [AF] Should we have a "dry-run" release without publish e.g. on the Monday before OpenJDK's release date
- [MB] Should we do "RC" builds as upstream OpenJDK does
- [AA] Can we have a better visible release status like https://gist.github.com/aahlenst/bbb8ca9c87353e0c8928633961047340? With all the different branches/release dates (think ARM on 8), it's super hard to track.
References:
Meeting Results:
(Note: See the next comment for a concise list of Actions)
One-off things (likely don't need much discussion)
- [AF] ActiveNodeTimeout introduction caused docker-based builds to fail
To be addressed with discussion over “release warmup” later. - [AF] Test jobs sequencing issue with openjdk-jenkins-helper not being loaded
Ditto. - [AF] Directory not empty issues on Windows
Already fixed, and job modified to clear up after itself, so shouldn’t happen again. - [AF] Job generation not thread safe. "No suitable checks publisher found" warnings.
AF to raise build issue to resolve.
Issues:
- [**] Nightly builds were not fully stopped during release (SXA to document based on slack discussion)
Skipped as per note. - [MV] Weekly runs should also be disabled during the release
Not enabled yet. Will be discussed later on. - [SA] Discussions on releases happening outside #release makes it hard to keep track of release activity
Summaries and key links to be placed in #release. Discuss in #build or #test if you want, but be sure to link threads, URLs, etc. - [SA] apt installers for 8u272 suffer a gap in update time which affects end users (doc so releasers are aware?)
George & Stewart volunteers to write issue to develop documentation issue. - [SA] Shenandoah not enabled for the JDK11u release
Adding shenandoahTest as part of a set of smoke tests / adoptium/aqa-tests#2067) - [SA] When things are merged, ensure someone is responsible for verification to avoid breakage
Yes. Improvements to the PR tester pending, to ensure this happens automatically for some PRs. - [SA] Docker arm32 release took a long time to get published
Release tagging in aarch src repos can be slow, and this slows the release. Dino recommends having a script to tag these repos automatically. Bharath is said to be working on this. - [SA] Macos packaging missing JREs
This is the one George is working on. Installer repo PR tester has been updated to detect this in the future. - [SA] Build repo lockdown had some "leakage" which broke Solaris/SPARC (& others?)
Stewart to announce this on Slack. - [SA] 11.0.9.1+1: Various issues with the patch number in build and API Windows installer version numbers and sorting
Andrew to make issue to discuss solution to this. - [SA] Releasing doc missing info on updating tags
An issue will be raised to resolve this.
Questions:
- [AF] Should all calls in the community be open? Is there scope for limited-access calls (beyond the TSC)
If possible, yes. Advise starting all calls in public channels. If doesn’t happen, then at least post summary in public channels. - [MV] Should we prioritise platforms e.g. run Windows/x64, Linux/x64, Macos/x64 pipelines first?
Proposal to separate top-level build pipeline runs per major release into “important platforms” and “other platforms” (one top-level execution each). Issue to be raised to discuss this proposal.
Related: #186 and adoptium/aqa-tests#2037
Stewart will raise an issue for this discussion in the build repo. - [AL] How can we make handovers easy from one team member to another during releasing if required?
Reducing the number of manual steps will make handover easier (#178) - [SA] Should we have separate HotSpot/OpenJ9 release issues to avoid platform confusion and allow closing each separately)
Yes. George volunteers for creating a PR to effect these changes to the template. - [AF] Should we have a "dry-run" without publish e.g. on the Monday before OpenJDK's release date?
Probably. AF to raise an issue to debate this. Wide-ranging, so TSC repo. To include identifying scope of variable control (e.g. code freezes for repos), identifying buy-in, etc. - [MB] Should we do "RC" builds as upstream OpenJDK does
Ditto. - [AA] Can we have a better visible release status like https://gist.github.com/aahlenst/bbb8ca9c87353e0c8928633961047340? With all the different branches/release dates (think ARM on 8), it's super hard to track.
Yes. George volunteers to host a call to discuss this further. - [SL] Is there a benefit to creating an actual release checklist? Are there release steps that we can automate? Sample Release Checklist to improve Release Automation
This was determined to be covered by other items earlier in the retrospective.
References:
Releasing document in the build repository
WIP release checklist document based on the releasing doc
Actions list:
Adam Farley:
- Raise build issue to resolve the "test job generation not thread safe" issue. "No suitable checks publisher found" warnings. Issue raised.
- Raise TSC issue to debate "dry-run without publish" (e.g. on the Monday before OpenJDK's release date)?
To include identifying scope of variable control (e.g. code freezes for repos), making the change to the release docs, methods, etc. Issue raised.
George & Stewart
- Raise issue to develop documentation for this problem: apt installers for 8u272 suffer a gap in update time which affects end users.
George Adams
- Create PR to effect this change to a template: Separate HotSpot/OpenJ9 release issues.
- Host a call to discuss having a better visible release status like https://gist.github.com/aahlenst/bbb8ca9c87353e0c8928633961047340
(With all the different branches/release dates (think ARM on 8), it's super hard to track.)
Stewart Addison
- Make Slack announcement regarding this: Build repo lockdown had some "leakage" which broke Solaris/SPARC (& others?)
- Raise build issue to discuss platform prioritization (e.g. run Windows/x64, Linux/x64, Macos/x64 pipelines first).
Proposal to separate top-level build pipeline runs per major release into “important platforms” and “other platforms” (one top-level execution each).
Related: #186 and adoptium/aqa-tests#2037
Andrew Leonard
- Raise issue to discuss solution for various issues with the patch number in build and API Windows installer version numbers and sorting (re: 11.0.9.1+1).
Since the actions for this will be chased independently by their respective owners, this issue will now be closed.
Thank you everyone for participating.
@adamfarley I feel this probably shouldn't be closed until we have issues covering them, otherwise the work looks complete as there is no outstanding issues for several of these with owners
Will reopen if you think that will encourage folks to follow up.
My thought was that it'd be easier to close this and simply copy the actions into January's retrospective issue, reviewing the results then.
I think the right way forward is to reopen it as you suggest, and to copy the actions once people have had a chance to update them with links to their issues.
Note: Any unresolved actions have been folded into the next retrospective for review. Link.
If any have been unintentionally missed, feel free to add them.