OpenLiberty/open-liberty

Update Resolver so that it can better handle tolerates

Closed this issue · 32 comments

Currently the resolver (when using resolve to install a single feature) creates a single set of prereqs. For features that have tolerates we could (easily) end up in a situation where the list of features are not compatible with each other as resolve was not designed with tolerates in mind and does not know the list of features a user has in their server.xml file (they could use multiple server.xml files too each with different dependencies, so installing one prereq chain is unlikely to work for both servers).

Quote from email chain: "It used to be that you could install any feature and get all of its dependencies and be sure it would start with any other feature, but that changed when we added tolerates. It got way worse with EE9 because we started using tolerates a lot more so that we could have one version of a management-type feature (like admin center) which works with both EE8 and EE9. I don't think it's even possible for a user to manually install all possible dependencies of a feature since we have public features which tolerate multiple versions of a private feature and we don't document the existence of the private features."

Proposal

To fix this we could instead install every feature in the prereq list instead of just 1 set of prereqs. This will end up installing more features, quite possibly more than required, but without knowing the server.xml we can't tell which exact features are required.

This work could either be done in the kernel resolver (which already does a lot of the resolver work) or in the repository resolver layer (which does a lot less work but does know about install lists which is something we could take advantage of).

Check the existing docs and guides to see if we need to update them regarding this change

Case 1 (most common scenario).

User installs features independently and starts the server

./installUtility install jakartaee-9.1 localConnector-1.0 microProfile-5.0

OR

./featureUtility installfeature jakartaee-9.1 localConnector-1.0 microProfile-5.0

server.xml configuration

<server description="new server">

   <!-- Enable features -->
   <featureManager>
        <feature>jakartaee-9.1</feature>
        <feature>localConnector-1.0</feature>
        <feature>microProfile-5.0</feature>
  </featureManager>

then start server:

./server start

….
[2022-08-03, 11:46:15:133 EDT] 00000039 com.ibm.ws.kernel.feature.internal.FeatureManager            E CWWKF0044E: The jakartaee-9.1 and io.openliberty.distributedMapInternal-1.0 features cannot be loaded at the same time. The jakartaee-9.1 feature of Jakarta EE 9 is incompatible with the io.openliberty.distributedMapInternal-1.0 feature of Java EE 6. The jakartaee-9.1 and jakartaee-9.1 configured features include an incompatible combination of features. Your configuration is not supported. Update the configuration to use features that support either the Jakarta EE or Java EE programming models, but not both.
[2022-08-03, 11:46:15:144 EDT] 00000039 com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "java.lang.IllegalArgumentException: Unable to load conflicting versions of features "com.ibm.websphere.appserver.eeCompatible-9.0" and "com.ibm.websphere.appserver.eeCompatible-6.0".  The feature dependency chains that led to the conflict are: io.openliberty.jakartaee-9.1 -> io.openliberty.mail-2.0 -> com.ibm.websphere.appserver.eeCompatible-9.0 and io.openliberty.jakartaee-9.1 -> com.ibm.websphere.appserver.restConnector-2.0 -> com.ibm.websphere.appserver.restHandler-1.0 -> com.ibm.wsspi.appserver.webBundleSecurity-1.0 -> com.ibm.websphere.appserver.distributedMap-1.0 -> io.openliberty.distributedMapInternal-1.0 -> com.ibm.websphere.appserver.eeCompatible-6.0 com.ibm.ws.kernel.feature.internal.FeatureManager reportErrors" at ffdc_22.08.03_11.46.15.0.log
[2022-08-03, 11:46:15:154 EDT] 00000039 com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "java.lang.IllegalArgumentException: Unable to load conflicting versions of features "com.ibm.websphere.appserver.servlet-3.1" and "com.ibm.websphere.appserver.servlet-5.0".  The feature dependency chains that led to the conflict are: io.openliberty.jakartaee-9.1 -> com.ibm.websphere.appserver.restConnector-2.0 -> com.ibm.websphere.appserver.restHandler-1.0 -> com.ibm.wsspi.appserver.webBundleSecurity-1.0 -> com.ibm.wsspi.appserver.webBundle-1.0 -> io.openliberty.webBundle.internal.ee-6.0 -> com.ibm.websphere.appserver.servlet-3.1 and io.openliberty.jakartaee-9.1 -> io.openliberty.batch-2.0 -> com.ibm.websphere.appserver.servlet-5.0 com.ibm.ws.kernel.feature.internal.FeatureManager reportErrors" at ffdc_22.08.03_11.46.15.1.log
[2022-08-03, 11:46:15:157 EDT] 00000039 com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "java.lang.IllegalArgumentException: Unable to load conflicting versions of features "com.ibm.websphere.appserver.transaction-1.1" and "com.ibm.websphere.appserver.transaction-2.0".  The feature dependency chains that led to the conflict are: io.openliberty.jakartaee-9.1 -> io.openliberty.messagingClient-3.0 -> io.openliberty.messaging.internal-3.0 -> io.openliberty.connectors-2.0.internal -> com.ibm.websphere.appserver.connectionManagement-1.0 -> io.openliberty.connectionManager1.0.internal.ee-6.0 -> com.ibm.websphere.appserver.transaction-1.1 and io.openliberty.jakartaee-9.1 -> io.openliberty.messagingClient-3.0 -> io.openliberty.messaging.internal-3.0 -> io.openliberty.connectors-2.0.internal -> com.ibm.websphere.appserver.transaction-2.0 com.ibm.ws.kernel.feature.internal.FeatureManager reportErrors" at ffdc_22.08.03_11.46.15.2.log
[2022-08-03, 11:46:15:160 EDT] 00000039 com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "java.lang.IllegalArgumentException: Unable to load conflicting versions of features "io.openliberty.servlet.api-3.0" and "io.openliberty.servlet.api-5.0".  The feature dependency chains that led to the conflict are: io.openliberty.jakartaee-9.1 -> com.ibm.websphere.appserver.restConnector-2.0 -> com.ibm.websphere.appserver.restHandler-1.0 -> com.ibm.wsspi.appserver.webBundleSecurity-1.0 -> com.ibm.websphere.appserver.distributedMap-1.0 -> io.openliberty.distributedMapInternal-1.0 -> io.openliberty.servlet.api-3.0 and io.openliberty.jakartaee-9.1 -> io.openliberty.webProfile-9.1 -> io.openliberty.cdi-3.0 -> io.openliberty.servlet.api-5.0 com.ibm.ws.kernel.feature.internal.FeatureManager reportErrors" at ffdc_22.08.03_11.46.15.3.log

Case 2.

installUtility download <features> uses resolve() api. When users download features to their local repository then try to install features from that local repository, it might fail due to missing dependencies. Issue https://github.ibm.com/websphere/WS-CD-Open/issues/24204

Case 3 .

Currently Install kernel calls resolve() api twice to make sure auto features are installed. If we call just once, then the following install scenario fails: Issue https://github.ibm.com/websphere/WS-CD-Open/issues/27325

  1. Install cdi-2.0 and ejbLite-3.2 using the resolve() api
  2. The auto feature com.ibm.websphere.appserver.cdi2.0-ejb3.2 should be installed.

Quote from email:

When you call resolve(), the resolver will return
Any of the requested features and their dependencies which are not installed
Any autofeatures which would be enabled by the combination of the features already installed and the features to be installed

When you call resolveAsSet(), the resolver will return
All not installed features which are needed to start the server with the requested features enabled
Any autofeatures which should be enabled when the server is started with the requested features

This means:
The install code should only ever need to call either resolve() or resolveAsSet() once - any needed autofeatures will be included in the result
If you somehow manage to get into a situation where all the dependencies of an autofeature are installed but the autofeature itself isn't, any call to resolve() will return the autofeature

There was agreement that the installer will now attempt to install every prereq "compatible" feature that will result in a larger image as it will install feature versions possibly not used, but the benefit is the user won't run into incompatible error messages...

Can't look at this until Telemetry feature is done.

For case 1 & 2, it sounds like we want to change RepositoryResolver so that it finds all tolerated children of the requested features. Is anyone assigned from the install side to update their tests?

For case 3, do we have an example where this fails? I would like to know what the state of the server is, what is passed into the repository resolver and what it returns from resolve().

Lastly, do we need to be concerned with the case where the user installs some features using resolveAsSet() and then later adds some with resolve()? E.g. resolveAsSet(EE10 + adminCenter-1.0), then resolve(EE8). There would definitely be extra private features for adminCenter which we need to install if you want adminCenter to work with EE8 in this example.

To be sure that everything will start together, we need to fetch any features tolerated by the already installed features which aren't already installed, even if they're not used by any of the requested features.

I'm not sure if we have any base images which could cause this scenario without the user explicitly doing an install as set, followed by an install.

@bensonlatibm FYI comment above regarding required install tests covering these problem scenarios.

@jjiwooLim , can you add details for the tests, and add any other feedback for Andrew's questions?

@Azquelt, I think we do want to handle the case where a user installs some features with resolveAsSet() followed by resolve()

For Case 1&2, I created an issue #24315.

Case 3 example

Failure Scenario 1:
1.install cdi-2.0 using resolve()
2.install ejbLite-3.2 using resolveAsSet() (server.xml)
doesn’t install com.ibm.websphere.appserver.cdi2.0-ejb3.2.mf. server starts without any warnings or errors.

Failure Scenario 2:
1.install cdi-2.0 using resolveAsSet()
2.install ejbLite-3.2 using resolveAsSet() —> doesn’t install com.ibm.websphere.appserver.cdi2.0-ejb3.2.mf. It should have installed because both cdi-2.0 and ejbLite is installed.
3.install both cdi-2.0 and ejbLite-3.2 using resolveAsSet() —> installs com.ibm.websphere.appserver.cdi2.0-ejb3.2.mf

Success Scenario :
install cdi-2.0 using resolve()
install ejbLite-3.2 using resolve()
installs com.ibm.websphere.appserver.cdi2.0-ejb3.2.mf

@jjiwooLim This is great, thank you :)

Sounds like the changes needed are:

  • resolve() should return all tolerated dependencies of the requested features and all tolerated dependencies of all installed features
  • resolveAsSet() should return all autofeatures which would be enabled by any combination of the installed features + the resolved new features. (I.e. X is already installed, we resolveAsSet (A, B), we should get an autofeature enabled by A+X, even though X is not in the set to resolve)

I've got a potential set of changes in #24638

There was one scenario I wasn't sure about around installing tolerated dependencies of an autofeature which aren't required for resolveAsSet which can be seen in this commit: 163a7c6

There are a number of auto-features in open liberty which have dependencies with tolerates:

com.ibm.websphere.appserver.batchSecurity-1.0.feature
com.ibm.websphere.appserver.jaxrsAppSecurity-2.0.feature
com.ibm.websphere.appserver.mpRestClient1.0-ssl1.0.feature
com.ibm.websphere.appserver.webAppSecurity-1.0.feature
io.openliberty.adminCenter1.0.javaee.feature
io.openliberty.autoPasswordUtilities1.0.javaee.feature
io.openliberty.batchManagement1.0.javaee.feature
io.openliberty.batchManagement1.0-messaging3.0.feature
io.openliberty.batchSecurity-2.0.feature
io.openliberty.data1.0-jdbc.feature
io.openliberty.jwtSso1.0.javaee.feature
io.openliberty.oauth2.0.javaee.feature
io.openliberty.openidConnectClient1.0.javaee.feature
io.openliberty.openidConnectServer1.0.javaee.feature
io.openliberty.socialLogin1.0.javaee.feature
io.openliberty.webCache1.0.javaee.feature
io.openliberty.wsSecurity1.1.javaee.feature

Failure Scenario 2:
1.install cdi-2.0 using resolveAsSet()
2.install ejbLite-3.2 using resolveAsSet() —> doesn’t install com.ibm.websphere.appserver.cdi2.0-ejb3.2.mf. It should have installed because both cdi-2.0 and ejbLite is installed.
3.install both cdi-2.0 and ejbLite-3.2 using resolveAsSet() —> installs com.ibm.websphere.appserver.cdi2.0-ejb3.2.mf

@jjiwooLim @bensonlatibm Thinking about it further, I don't think this is a valid case at all. When you resolveAsSet(), you should get exactly the set of features required to allow that set of features to start and nothing more.

Furthermore, actually making this work fully is really difficult. As noted above, autofeatures can have tolerated dependencies. In general, we can only resolve tolerated dependencies correctly when resolving as a set. As soon as you start wanting to pull in more features that aren't needed by the set, we don't know which tolerated dependency to include.

The only plausible solution to this is to pull in all tolerated versions of a dependency (and all tolerated versions of their dependencies etc.) which risks bloating the size of the install.

We need to the understand and consider the impact of this change to the build plugins and dev mode, especially when it comes to performance and user experience. (We are looking for optimal path for installs and speed with feature resolver here.)

Cc @scottkurz @cherylking @TrevCraw

To follow-up on YK's comment, we just opened an issue #24592 in the same general area because performance is so slow, when using Liberty Maven Plugin, when the features are all already installed. Not sure if it's helpful here to point out but FYI.

I've added a comment to issue #24592 detailing what I know about the stacktrace which @scottkurz posted.

This issue might be tangentially relevant to this one, I think install currently tries to resolve features twice because they weren't happy with the autofeatures returned by the resolver, if this change allows them to resolve only once that might speed things up a bit.

We'd need the corresponding install changes to go with my resolver changes to be able to test whether there's a difference.

@yeekangc To clarify my last comment, I don't think this change will make performance any worse and it may make it a bit better (maybe 2x speedup).

This issue was about ensuring that we pick the correct features to install. Other than ensuring we don't regress, I don't want to look at performance in this issue.

I'm much more concerned about problems where users install features on the command line and then their server doesn't start. I would really appreciate input from you and @scottkurz on my previous comment - I believe it was Scott who raised this scenario as an issue originally in OpenLiberty/ci.maven#747 (with the steps to reproduce clarified by Alvin in this comment).

Thanks, Andrew. I am of the opinion that we shouldn't install more than what is needed especially for dev mode-related scenarios.

I will let @scottkurz chime in further regarding what he reported initially. Feels like we should double check if and where we are using resolve() in the build plugins.

I talked to @scottkurz about OpenLiberty/ci.maven#747.

It looks like in that issue that the original problem was "I start with cdi-2.0, then I add ejbLite-3.2 and the autofeature doesn't get installed". I'd agree that this scenario was a bug. Talking with Scott, he thinks the development tools should be passing through all the server.xml features, so when ejbLite-3.2 is added, we would end up doing resolveAsSet(List.of("cdi-2.0", "ejbLite-3.2")) which would return the autofeature.

However, in the comments of that issue, the problem got turned into "I start with cdi-2.0, I remove cdi-2.0 and add ejbLite-3.2 and the autofeature doesn't get installed". I don't think this is a bug and it throws up the problems with autofeatures having tolerated dependencies which I went into in a previous comment.

@jjiwooLim @bensonlatibm Do you agree that we can drop the requirement for the following scenario where we expect resolveAsSet() to return autofeatures which are satisfied by the installed features but wouldn't be needed when starting a server with just the requested features?

Failure Scenario 2:
1.install cdi-2.0 using resolveAsSet()
2.install ejbLite-3.2 using resolveAsSet() —> doesn’t install com.ibm.websphere.appserver.cdi2.0-ejb3.2.mf. It should have installed because both cdi-2.0 and ejbLite is installed.

If we don't drop this requirement, we need to refine it to state what we expect to happen when an autofeature has dependencies with tolerates.

I'm on vacation next week, so I've gone ahead and made this change in my PR #24638, along with the review feedback from @pnickoll.

We can still change this part again if a different decision about the desired behaviour here is made, but I wanted to make sure the install team have something they can test with while I'm away.

I think the next action here is for @jjiwooLim and @bensonlatibm to take my resolver changes and run the install tests against them (after any required test updates to account for the new expected behaviour).

I agree that we can drop failure scenario 2, and I will adjust our FAT test accordingly. With the changes you made, I have removed the additional resolve() API call on our install side and tested a few test cases manually, all of which completed as expected. I believe this change will also improve performance.

Regarding the implementation of FAT tests for issue #24315, I plan to create the PR by the end of this week. I am waiting to merge a PR that refactors our FAT.

The changes in #24638 are complete, the review changes have been squashed and I've rebased it on the latest release branch. A personal build is running now - all the resolver tests should pass but I expect some of the install tests will fail.

@jjiwooLim You can now take these changes and add whatever changes are needed for install and the install tests. Then we should be finally ready to merge (pending feature reviews, thank you Tom for adding the no-design-approval-request).

Let me know if you find any problems or unexpected changes in behaviour, or if you have any questions.

@Azquelt I removed the additional resolve() call and added tests for #24315. It's ready to be merged. Thank you.

@jjiwooLim Thanks, I'll have a look. @tevans78 reminded me that this epic should probably be approved before we merge anything since we're just changing the behaviour and it's not going through a beta or anything. I'll follow up on that tomorrow.

This Epic is really a bug fix for the problem that a user can install features from the command line but end up not being able to use them together because the required tolerated dependencies have not been installed.

For this reason, the code changes have been done as we would for a bug fix - changing the code directly rather than creating a parallel flow with a guard.

I propose that we merge this change after the next release is cut so that there is time to revert it if there are issues.

Remaining process for this epic:


GA

A feature is ready to GA after it is Feature Complete and has obtained all necessary Focal Point Approvals.

Feature Complete

  • Feature implementation and tests completed.
    • All PRs are merged.
    • All epic and child issues are closed.
    • All stop ship issues are completed.
  • Legal: all necessary approvals granted. (No new dependencies)
  • Translation: All messages translated or sent for translation for upcoming release (No new messages)
  • GA development complete and feature ready for inclusion in a GA release
    • Add label target:ga and the appropriate target:YY00X (where YY00X is the targeted GA version).
    • Inclusion in a release requires the completion of all Focal Point Approvals.

Focal Point Approvals (Complete by Feature Complete Date)

These occur only after GA of this feature is requested (by adding a target:ga label). GA of this feature may not occur until all approvals are obtained.

All Features

  • APIs/Externals Externals have been reviewed or N/A. (OpenLiberty/externals-approvers)
    • Approver adds label focalApproved:externals
  • Demo Demo is scheduled for an upcoming EOI or N/A. (OpenLiberty/demo-approvers)
    • Add comment @OpenLiberty/demo-approvers Demo scheduled for EOI [Iteration Number] to this issue.
    • Approver adds label focalApproved:demo.
  • FAT All Tests complete and running successfully in SOE or N/A. (OpenLiberty/fat-approvers)
    • Approver adds label focalApproved:fat.
  • Globalization Translation and TVT are complete or N/A. (OpenLiberty/globalization-approvers)
    • Approver adds label focalApproved:globalization.

GA Blog (Complete by Feature Complete Date)

  • GA Blog issue created and populated using the Open Liberty GA release blog post template.
    • Add a link to the GA Blog issue in the Documents section.
    • Note: This is for inclusion into the overall release blog post. If, in addition, you'd also like to create a dedicated blog post about your feature, then follow the "Standalone Feature Blog Post" instructions under the Other Deliverables section.

Post GA

@OpenLiberty/globalization-approvers can we have globalization approval please? There are no message changes.

@OpenLiberty/externals-approvers No APIs added or changed in this feature. Is there anything else that needs to be considered for externals approval?

@OpenLiberty/demo-approvers Demo scheduled for EOI 23.10

@OpenLiberty/fat-approvers The feature test summary is #25262.

Please let me know if you're happy with it and if there's anything else you need from us other than a clean SOE run.

ayoho commented

@Azquelt Everything is looking good. All I'm waiting for is #25106 to make it through an SOE. Are there any other PRs I need to be aware of for code related to this feature?

@ayoho No, that's the only one.

The code for this was delivered in #23106, but then reverted in #25377 after an unrelated unit test picked up a performance issue.

After we cut 23.0.0.6, I hope to redeliver the changes with additional fixes in #25446 for 23.0.0.7.

The changes for this feature have been redelivered. I'll be keeping a watch to see when they're in the release branch so we know when to expect it to be included in an SOE run.

Changes were reverted again and redelivered again in #25510 on 12 July (new features added for EE11 showed up additional performance issues).

Targeting 23.0.0.8 and waiting for an SOE run which includes the change.

Opened a GA blog issue for this: #25926

@Azquelt @pnickoll please close this issue. Thanks!