aws/random-cut-forest-by-aws

Add a CD pipeline that releases this project to maven

dblock opened this issue · 15 comments

Currently this project is built on a developer machine, and occasionally pushed to maven.

  1. Add a CD process that publishes a -SNAPSHOT to a snapshot maven repo with every commit.
  2. Add a CD process that publishes a release to maven central on demand.

This will close #301

Discussed with DB, he suggested:

  • Add a CD process that publishes a -SNAPSHOT to a snapshot maven repo with every commit.
  • Add a CD process that publishes a release to maven central on demand.

AD references X.Y-SNAPSHOT in its repo during development (picks up latest code), ensuring that the changes in RCF are constantly tested and bugs caught early.

Should we use similar way of OpenSearch to publish RCF to maven? I think it will be good if OpenSearch build workflow can cover RCF.
If we shouldn't include RCF in OpenSearch build workflow, we should confirm current RCF manual publish workflow and if we have dedicated sonatype space for RCF or any other common space we can use.

RCF is not part of opensearch-project, so I don't think it should use any OpenSearch project infrastructure unless you want to move the repo into that organization

I would follow https://docs.github.com/en/actions/publishing-packages/publishing-java-packages-with-gradle

Thanks @dblock , do you think we should follow the same practice of OpenSearch: publish pre-release RCF to sonatype repo first. Once test done, we can publish to public maven(maven central).

Thanks @dblock , do you think we should follow the same practice of OpenSearch: publish pre-release RCF to sonatype repo first. Once test done, we can publish to public maven(maven central).

Yes, that's what most projects do, right? Although on a second thought what kind of testing would one do with the pre-release? Is staging really needed?

From AD experience, RCF added TRCF and bumped version to 2.0, then we use local jar file to test TRCF in AD and found some bugs, then RCF fixed these bugs and back to AD. We did several iteration to reach a stable version. I think we should follow the same practice of OpenSearch and publish RCF to sonatype repo first. Once the new RCF version tested in AD and other places like MLCommmons, we can publish to public maven.

@ylwu-amzn My 0.02c is that this feels like the library is using AD testing to find bugs and in the future unnecessarily holds this library back. Thus, that's probably not a good long term strategy.

The story above could have been "AD was using 2.0-SNAPSHOT, found and reported a bunch of bugs which were fixed + tests, then eventually random-cut-forest-by-aws 2.0 was released, then more bugs were found that were fixed + tests, then 2.1 was released".

Yes, the story much like SQL and MLCommons plugin, we publish MLCommons 1.3-SNAPSHOT first, then SQL will consume this 1.3-SNAPSHOT to develop new PPL ML command and test. The difference is we have same release date for all plugins and OpenSearch core. So we can hold all -SNAPSHOT on sonatype first, once the IT passed and OpenSearch bundle test done, we can release OpenSearch and publish to public Maven.

For RCF, as this is not part of OpenSearch, maybe we can follow different release timeline and decouple the testing. RCF should test and make sure all new features work as expected first, then they can choose to publish as -SNAPSHOT or final public version. We don't need to test RCF new features in AD or MLCommons. If AD or MLCommons side found any bug, we can report to RCF and ask for a fix and new version, just like using any other external lib.

But I'm totally fine if RCF team think some feature/change is too big and we'd better test -SNAPSHOT version in AD or MLCommons first.

I think random-cut-forest-by-aws should be publishing -SNAPSHOT builds with every commit so that those projects that depend on it can do early testing. Everything else seems like overhead.

https://github.com/aws/aws-xray-sdk-java/blob/master/.github/workflows/release-build.yml

That's a good example. @amitgalitz , I think we can use the same way. @dblock do we need security review?

Thank you! Really good find! This is pretty similar to this guide. https://docs.github.com/en/actions/publishing-packages/publishing-java-packages-with-maven The main blocking point right now was related to the security and how best to store/fetch the current credentials. Also this is an action that is manually triggered to publish to maven central which addresses one of the goals. I'll make sure to see if its as simple as changing the trigger event and some other condition to also add a workflow to publish a -SNAPSHOT on merge.

Some quick updates:

  1. we've done a review with security team about putting credential onto github, and got approved.
  2. there were some discussion between doing this with maven vs converting the deployment script to gradle to follow the same practice as all other opensearch plugins.
  3. we decided to implement it in maven first, meanwhile discuss with the group on the feasibility/benefit of gradle option
  4. we'll publish a PR for the maven solution soon.

PR 334 merged